Embark on a journey of knowledge! Take the quiz and earn valuable credits.
Take A QuizChallenge yourself and boost your learning! Start the quiz now to earn credits.
Take A QuizUnlock your potential! Begin the quiz, answer questions, and accumulate credits along the way.
Take A Quiz
Creating Powerful Predictors and Choosing What Matters
Most
🧠 Introduction
By now, you've cleaned your dataset and explored it
visually. The next step is arguably the most creative and impactful part of
a data science project:
➡️ Feature Engineering — creating new input
variables that enhance model performance
➡️ Feature Selection — picking the most
relevant variables to prevent noise and overfitting
In many real-world cases, great features can outperform even
complex algorithms.
This chapter teaches:
🧩 1. What is Feature
Engineering?
Feature engineering is the process of using domain
knowledge to extract features from raw data, making machine learning models
more accurate and efficient.
Raw Column |
Engineered Feature |
Date of birth |
Age |
Timestamp |
Day of week,
hour, weekend flag |
Transaction log |
Time since last
transaction |
⚙️ 2. Types of Feature
Engineering
🔹 2.1 Mathematical
Transformations
Useful for reducing skewness, scaling, and normalization.
python
import
numpy as np
df['Fare_log']
= np.log1p(df['Fare'])
df['Age_squared']
= df['Age'] ** 2
🔹 2.2 Binning
(Discretization)
Split continuous variables into groups.
python
df['Age_group']
= pd.cut(df['Age'], bins=[0, 18, 35, 60, 100],
labels=['Teen',
'Adult', 'Senior', 'Elder'])
🔹 2.3 Date-Time
Decomposition
Turn date columns into features.
python
df['Date']
= pd.to_datetime(df['Date'])
df['Year']
= df['Date'].dt.year
df['Month']
= df['Date'].dt.month
df['DayOfWeek']
= df['Date'].dt.dayofweek
🔹 2.4 Text Feature
Extraction
Length, keyword flagging, or NLP embeddings.
python
df['review_length']
= df['review_text'].str.len()
df['has_discount']
= df['review_text'].str.contains("discount", case=False).astype(int)
🔹 2.5 Interaction
Features
Combine features to discover new relationships.
python
df['Fare_per_person']
= df['Fare'] / (df['SibSp'] + df['Parch'] + 1)
🔹 2.6 Group Aggregation
Features
Use .groupby() to summarize over categories.
python
df['Avg_Fare_by_Class']
= df.groupby('Pclass')['Fare'].transform('mean')
🎯 3. What is Feature
Selection?
Feature selection means choosing the most relevant and
informative features for your model — and removing irrelevant, redundant,
or noisy ones.
⚠ Why it matters:
📊 4. Feature Selection
Techniques
🔸 4.1 Filter Methods
Use statistical measures to score features.
Correlation-based Removal
python
corr_matrix
= df.corr().abs()
upper
= corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
to_drop
= [column for column in upper.columns if any(upper[column] > 0.9)]
df.drop(columns=to_drop,
inplace=True)
Variance Threshold (removes low-variance features):
python
from
sklearn.feature_selection import VarianceThreshold
sel
= VarianceThreshold(threshold=0.01)
df_reduced
= sel.fit_transform(df)
🔸 4.2 Wrapper Methods
Use a model to test combinations of features.
Recursive Feature Elimination (RFE):
python
from
sklearn.feature_selection import RFE
from
sklearn.linear_model import LogisticRegression
model
= LogisticRegression()
rfe
= RFE(model, n_features_to_select=5)
rfe.fit(X,
y)
selected
= X.columns[rfe.support_]
print("Selected
Features:", selected)
🔸 4.3 Embedded Methods
Feature selection happens inside the model.
Feature Importance with Random Forest:
python
from
sklearn.ensemble import RandomForestClassifier
model
= RandomForestClassifier()
model.fit(X,
y)
importances
= pd.Series(model.feature_importances_, index=X.columns)
importances.sort_values().plot(kind='barh')
🧪 5. Comparing Models
With and Without Feature Selection
Feature Set |
Accuracy |
Training Time |
Overfitting Risk |
All features |
0.82 |
High |
Higher |
Top 10 features |
0.83 |
Lower |
Lower |
Top 5 features |
0.81 |
Lowest |
Lowest |
Conclusion: sometimes less is more.
📋 6. Summary Table:
Engineering Methods
Method |
Purpose |
Tool |
Log transform |
Reduce skew |
np.log1p() |
Binning |
Categorize
continuous values |
pd.cut() |
Group mean |
Capture category-wise
effects |
.groupby().transform() |
Feature interaction |
Combine
features |
Arithmetic
ops |
Text extraction |
Quantify text fields |
.str.len(),
.str.contains() |
Label/One-hot encode |
Convert to
numeric |
LabelEncoder,
get_dummies() |
Feature selection |
Reduce dimensionality |
RFE, RandomForest |
✅ Full Pipeline Example
python
import
pandas as pd
import
numpy as np
from
sklearn.preprocessing import LabelEncoder, StandardScaler
from
sklearn.feature_selection import RFE
from
sklearn.ensemble import RandomForestClassifier
#
Load data
df
= pd.read_csv('titanic.csv')
#
Feature engineering
df['FamilySize']
= df['SibSp'] + df['Parch'] + 1
df['Fare_log']
= np.log1p(df['Fare'])
#
Encode
le
= LabelEncoder()
df['Sex']
= le.fit_transform(df['Sex'])
#
Select features
features
= ['Pclass', 'Sex', 'Age', 'Fare_log', 'FamilySize']
X
= df[features].dropna()
y
= df['Survived'].loc[X.index]
#
Feature selection
model
= RandomForestClassifier()
rfe
= RFE(model, n_features_to_select=3)
rfe.fit(X,
y)
print("Selected:",
list(X.columns[rfe.support_]))
Answer: The data science workflow is a structured step-by-step process used to turn raw data into actionable insights or solutions. It ensures clarity, efficiency, and reproducibility from problem definition to deployment.
Answer: Not necessarily. While there is a general order, data science is iterative. You may go back and forth between stages (like EDA and feature engineering) as new insights emerge.
Answer: Data cleaning prepares the dataset by fixing errors and inconsistencies, while EDA explores the data to find patterns, trends, and relationships to inform modeling decisions.
Answer: You can build a baseline model early, but robust feature engineering often improves performance significantly. It's best to iterate and refine after EDA and feature transformations.
Answer: Popular tools include Python libraries like scikit-learn, XGBoost, LightGBM, and TensorFlow for building models, and metrics functions within sklearn.metrics for evaluation.
Answer: It depends on the problem:
Answer: Start with lightweight options like:
Answer: Use logging for predictions, track performance metrics over time, and set alerts for significant drops. Tools like MLflow, Prometheus, and AWS CloudWatch are commonly used.
Answer: Yes. For learning or portfolio-building, it's okay to stop after model evaluation. But deploying at least one model enhances your understanding of real-world applications.
Answer: Choose a simple dataset (like Titanic or housing prices), go through every workflow step end-to-end, and document your process. Repeat with different types of problems to build experience.
Please log in to access this content. You will be redirected to the login page shortly.
LoginReady to take your education and career to the next level? Register today and join our growing community of learners and professionals.
Comments(0)