Embark on a journey of knowledge! Take the quiz and earn valuable credits.
Take A QuizChallenge yourself and boost your learning! Start the quiz now to earn credits.
Take A QuizUnlock your potential! Begin the quiz, answer questions, and accumulate credits along the way.
Take A Quiz
๐ง  Introduction
Preprocessing and feature engineering are among the most
crucial phases in the machine learning workflow. Models are only as good as the
data they receive โ and that data must be clean, consistent, informative,
and properly transformed. Poorly prepared data leads to poor
generalization, no matter how advanced your algorithm.
In this chapter, weโll dive deep into how to preprocess data
and engineer features using Scikit-Learn. Weโll also explore how to build
robust and reusable pipelines to automate your ML workflow โ ensuring
consistency, reducing errors, and enabling efficient deployment.
๐ง 1. Understanding Data
Preprocessing
๐ What is Preprocessing?
Preprocessing transforms raw input data into a clean,
standardized format that models can interpret. It involves dealing with:
๐ Why It Matters:
๐งน 2. Handling Missing
Data
Scikit-Learn offers SimpleImputer and KNNImputer for
replacing missing values.
๐ง Example Using
SimpleImputer:
python
from
sklearn.impute import SimpleImputer
imputer
= SimpleImputer(strategy='median')
X_clean
= imputer.fit_transform(X)
๐ Table: Imputation
Strategies
| Strategy | Use Case | 
| Mean | Normally distributed
  features | 
| Median | Skewed data | 
| Most frequent | Categorical variables | 
| Constant | Special value
  like 0 or "none" | 
๐ท๏ธ 3. Encoding
Categorical Variables
๐ Why Encode?
Most ML algorithms work only with numbers. Categorical
features need to be encoded into numeric form.
๐ Scikit-Learn Options:
python
from
sklearn.preprocessing import OneHotEncoder
encoder
= OneHotEncoder(sparse=False)
X_encoded
= encoder.fit_transform(X[['gender']])
๐ Mapping Categories
Manually
For ordinal variables:
python
size_map
= {'Small': 1, 'Medium': 2, 'Large': 3}
df['size']
= df['size'].map(size_map)
๐ Table: Encoding
Techniques
| Method | Use Case | Scikit-Learn Class | 
| Label Encoding | Ordinal, small
  categories | OrdinalEncoder | 
| One-Hot Encoding | Nominal,
  multiple classes | OneHotEncoder | 
| Binary Encoding | High cardinality
  features | Use category_encoders | 
๐ 4. Feature Scaling
Algorithms like KNN, SVM, and gradient descent benefit from feature
scaling. It brings all features to the same scale to ensure fair weightage.
๐ง Scikit-Learn Options:
python
from
sklearn.preprocessing import StandardScaler
scaler
= StandardScaler()
X_scaled
= scaler.fit_transform(X)
๐ง  5. Feature Engineering
Feature engineering involves creating new features from
existing data to improve model performance.
๐ Techniques:
python
from
sklearn.preprocessing import PolynomialFeatures
poly
= PolynomialFeatures(degree=2, include_bias=False)
X_poly
= poly.fit_transform(X)
๐ Table: Feature
Engineering Techniques
| Technique | Purpose | 
| Binning | Convert continuous to
  categories | 
| Polynomial features | Model
  non-linear relationships | 
| Feature
  decomposition | Break compound
  features | 
| Log/Box-Cox transforms | Normalize
  skewed distributions | 
| Aggregated features | Summarize groups or
  time windows | 
๐งฐ 6. ColumnTransformer:
Targeted Preprocessing
When you need to apply different transformations to
different feature types, ColumnTransformer comes in handy.
python
from
sklearn.compose import ColumnTransformer
from
sklearn.pipeline import Pipeline
from
sklearn.impute import SimpleImputer
from
sklearn.preprocessing import OneHotEncoder, StandardScaler
num_features
= ['age', 'salary']
cat_features
= ['gender', 'region']
preprocessor
= ColumnTransformer([
    ('num', Pipeline([
        ('imputer',
SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())
    ]), num_features),
    
    ('cat', Pipeline([
        ('imputer',
SimpleImputer(strategy='most_frequent')),
        ('encoder',
OneHotEncoder(handle_unknown='ignore'))
    ]), cat_features)
])
๐๏ธ 7. Building Reusable Pipelines
Scikit-Learn's Pipeline class allows you to encapsulate an
entire workflow: preprocessing + modeling. This reduces data leakage and makes
the process reproducible.
๐ง Full Example:
python
from
sklearn.pipeline import Pipeline
from
sklearn.ensemble import RandomForestClassifier
clf_pipeline
= Pipeline([
    ('preprocessing', preprocessor),
    ('classifier', RandomForestClassifier())
])
clf_pipeline.fit(X_train,
y_train)
โ
 Benefits of Using Pipelines:
๐ 8. Saving and Reloading
Pipelines
You can save your full pipeline for reuse in production
using joblib.
python
import
joblib
joblib.dump(clf_pipeline,
'model_pipeline.pkl')
# Later
clf_loaded
= joblib.load('model_pipeline.pkl')
๐งพ Summary Table:
Scikit-Learn Preprocessing Toolkit
| Task | Tool/Class | 
| Missing values | SimpleImputer,
  KNNImputer | 
| Scaling | StandardScaler,
  MinMaxScaler | 
| Encoding categoricals | OneHotEncoder,
  OrdinalEncoder | 
| Creating features | PolynomialFeatures,
  FunctionTransformer | 
| Targeted transforms | ColumnTransformer | 
| Pipeline management | Pipeline | 
| Model persistence | joblib, pickle | 
๐ก Conclusion
Preprocessing and feature engineering are the foundation of
any successful ML model. Scikit-Learn provides one of the most elegant and
modular systems for handling this part of the workflow โ especially when
combined with Pipeline and ColumnTransformer.
By embracing a systematic approach:
In the next chapter, weโll explore model selection,
evaluation, and hyperparameter tuning, the stage where data meets
algorithm.
An end-to-end machine learning project includes all stages of development, from defining the problem and gathering data to training, evaluating, and deploying the model in a real-world environment.
Scikit-Learn is widely adopted due to its simplicity, clean API, and comprehensive set of tools for data preprocessing, modeling, evaluation, and tuning, making it ideal for full ML workflows.
Scikit-Learn is not designed for deep learning. For such use cases, you should use frameworks like TensorFlow or PyTorch. However, Scikit-Learn is perfect for classical ML tasks like classification, regression, and clustering.
You can use SimpleImputer from sklearn.impute to fill in missing values with mean, median, or most frequent values as part of a pipeline.
Pipelines help you bundle preprocessing and modeling steps together, ensuring consistency during training and testing and reducing the chance of data leakage.
You should split your data into training and test sets or use cross-validation to assess performance. Scikit-Learn offers metrics like accuracy, F1-score, RMSE, and Rยฒ depending on the task.
Yes, models trained with Scikit-Learn can be serialized using joblib or pickle and deployed using tools like Flask, FastAPI, or cloud services such as AWS and Google Cloud.
Cross-validation is a method of splitting the data into multiple folds to ensure the model generalizes well. It helps detect overfitting and gives a more reliable performance estimate.
You can use GridSearchCV or RandomizedSearchCV to automate hyperparameter tuning and select the best model configuration based on performance metrics.
Yes, using transformers like OneHotEncoder or OrdinalEncoder, and integrating them within a ColumnTransformer, Scikit-Learn can preprocess both categorical and numerical features efficiently.
 
                Please log in to access this content. You will be redirected to the login page shortly.
Login 
                        Ready to take your education and career to the next level? Register today and join our growing community of learners and professionals.
 
                        Your experience on this site will be improved by allowing cookies. Read Cookie Policy
Your experience on this site will be improved by allowing cookies. Read Cookie Policy
Comments(2)