Embark on a journey of knowledge! Take the quiz and earn valuable credits.
Take A QuizChallenge yourself and boost your learning! Start the quiz now to earn credits.
Take A QuizUnlock your potential! Begin the quiz, answer questions, and accumulate credits along the way.
Take A Quiz
Craft Smarter Features and Pick What Matters for
Better Models
🧠Introduction
Once you've explored and understood your data, it’s time to
make it work harder for your model. That’s where feature engineering
and feature selection come in.
Together, these processes help you build more accurate
and efficient models.
In this chapter, we’ll cover:
📦 1. Why Feature
Engineering Matters
Real-world analogy:
If your dataset is a toolbox, then feature engineering is
the process of crafting better tools from the raw materials you have.
Benefits:
🛠2. Common Feature
Engineering Techniques
🔹 2.1 Mathematical
Transformations
Transform skewed data to stabilize variance or reduce the
effect of outliers.
python
import
numpy as np
df['Log_Fare']
= np.log1p(df['Fare']) # log(1 + Fare)
df['Sqrt_Age']
= np.sqrt(df['Age'])
🔹 2.2 Binning
(Discretization)
Convert continuous variables into categories.
python
df['Age_Group']
= pd.cut(df['Age'], bins=[0, 18, 35, 60, 100], labels=['Teen', 'Young',
'Adult', 'Senior'])
🔹 2.3 DateTime Features
Extract time-based features from timestamps.
python
df['Date']
= pd.to_datetime(df['Date'])
df['Year']
= df['Date'].dt.year
df['Month']
= df['Date'].dt.month
df['DayOfWeek']
= df['Date'].dt.dayofweek
🔹 2.4 Aggregation (Group
Features)
Group and summarize data for new insights.
python
#
Example: Mean income by city
df['City_Avg_Income']
= df.groupby('City')['Income'].transform('mean')
🔹 2.5 Interaction
Features
Multiply or combine features to create new relationships.
python
df['Age_Income_Ratio']
= df['Income'] / df['Age']
df['Is_Young_Rich']
= ((df['Age'] < 30) & (df['Income'] > 100000)).astype(int)
🔹 2.6 Text-Based Features
For text columns, derive:
python
df['Review_Length']
= df['Review'].str.len()
df['Has_Free_Shipping']
= df['Review'].str.contains('free shipping', case=False).astype(int)
📊 3. One-Hot and Label
Encoding (Review)
python
#
One-hot encode 'City'
df
= pd.get_dummies(df, columns=['City'], drop_first=True)
#
Label encode binary column
from
sklearn.preprocessing import LabelEncoder
le
= LabelEncoder()
df['Gender']
= le.fit_transform(df['Gender'])
🧹 4. Feature Selection –
Why Less is More
Not all features help — some are:
Benefits of selection:
Benefit |
Impact |
Reduced overfitting |
Prevents model
memorization |
Faster training time |
Less data to
process |
Improved accuracy |
Focus on signal, not
noise |
Better model interpretability |
Easier to
explain |
🧮 5. Feature Selection
Techniques
🔸 5.1 Variance Threshold
(Filter)
Remove features with low variability.
python
from
sklearn.feature_selection import VarianceThreshold
selector
= VarianceThreshold(threshold=0.1)
df_reduced
= selector.fit_transform(df)
🔸 5.2 Correlation Matrix
Drop highly correlated features (multicollinearity).
python
corr_matrix
= df.corr().abs()
upper
= corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
#
Drop if correlation > 0.9
to_drop
= [column for column in upper.columns if any(upper[column] > 0.9)]
df.drop(to_drop,
axis=1, inplace=True)
🔸 5.3 Recursive Feature
Elimination (RFE)
Uses a model to recursively remove weakest features.
python
from
sklearn.feature_selection import RFE
from
sklearn.linear_model import LogisticRegression
model
= LogisticRegression()
rfe
= RFE(model, n_features_to_select=5)
fit
= rfe.fit(df.drop('target', axis=1), df['target'])
selected_features
= df.drop('target', axis=1).columns[fit.support_]
print("Selected:",
selected_features)
🔸 5.4 Feature Importance
from Trees
Tree models like RandomForest give a score to each feature.
python
from
sklearn.ensemble import RandomForestClassifier
model
= RandomForestClassifier()
model.fit(X_train,
y_train)
importances
= pd.Series(model.feature_importances_, index=X_train.columns)
importances.sort_values().plot(kind='barh')
📦 6. Feature Selection
Pipeline (Step-by-Step)
Step |
Tool/Technique
Used |
Remove low variance |
VarianceThreshold() |
Remove high correlation |
df.corr() +
manual drop |
Rank by importance |
RandomForestClassifier().feature_importances_ |
Use wrapper method |
RFE,
SelectKBest, or SHAP |
📋 7. Summary Table:
Engineering Techniques
Technique |
Use Case |
Example |
Math transformation |
Reduce skew, scale
features |
log(), sqrt() |
Binning |
Convert
numerical → categorical |
Age groups |
Date decomposition |
Create seasonal trends |
.dt.month,
.dt.dayofweek |
Group aggregation |
Add context
by grouping |
Avg income by
city |
Feature interaction |
Capture relationships |
Income/Age ratio |
Text features |
Keyword
presence, length, etc. |
str.contains(),
.str.len() |
Encoding |
Make categories
numeric |
get_dummies(),
LabelEncoder |
✅ Final Code Snippet: Mini
Pipeline
python
#
Example feature engineering pipeline
df['Log_Fare']
= np.log1p(df['Fare'])
df['FamilySize']
= df['SibSp'] + df['Parch'] + 1
df['IsAlone']
= (df['FamilySize'] == 1).astype(int)
df['Title']
= df['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)
#
One-hot encode Title
df
= pd.get_dummies(df, columns=['Title'], drop_first=True)
#
Drop unused features
df.drop(['Ticket',
'Cabin', 'Name'], axis=1, inplace=True)
Answer: Not at all. Basic knowledge of statistics is helpful, but you can start your first project with a beginner-friendly dataset and learn concepts like mean, median, correlation, and regression as you go.
Answer: Python is the most popular and beginner-friendly choice, thanks to its simplicity and powerful libraries like Pandas, NumPy, Matplotlib, Seaborn, and Scikit-learn.
Answer: Great sources include:
Answer:
Answer: Keep it small and manageable — one target variable, 3–6 features, and under 10,000 rows of data. Focus more on understanding the process than building a complex model.
Answer: Yes, but keep it simple. Start with linear regression, logistic regression, or decision trees. Avoid deep learning or complex models until you're more confident.
Answer: Use:
Answer: Use:
Answer: It depends on your task:
Answer: Absolutely! A well-documented project with clear insights, code, and visualizations is a great way to show employers that you understand the end-to-end data science process.
Please log in to access this content. You will be redirected to the login page shortly.
LoginReady to take your education and career to the next level? Register today and join our growing community of learners and professionals.
Comments(0)