Embark on a journey of knowledge! Take the quiz and earn valuable credits.
Take A QuizChallenge yourself and boost your learning! Start the quiz now to earn credits.
Take A QuizUnlock your potential! Begin the quiz, answer questions, and accumulate credits along the way.
Take A Quiz
Prepare Your Numerical Data for Optimal Performance in
Machine Learning
🧠 Introduction
Machine learning models are sensitive to the scale of data.
Features with different ranges (like income in thousands and age in tens) can
confuse models, slow down training, or even lead to inaccurate predictions.
That’s why scaling and normalization are critical steps in
preprocessing.
In this chapter, you’ll learn:
🔍 What is Scaling?
Scaling changes the range of numerical values so that
different features become comparable. It prevents one feature from dominating
others simply due to its magnitude.
🔄 What is Normalization?
Normalization usually refers to rescaling the
values to a [0, 1] range (also known as Min-Max Scaling). However,
sometimes the term is also used interchangeably with feature scaling in
general.
📦 Why Is Scaling
Important?
Problem |
Caused By |
Impact |
Features on
different scales |
Age (1–100) vs Income
(10K–100K) |
Bias in distance-based
models (e.g., KNN, SVM) |
Slow convergence in gradient descent |
Large input
feature values |
Model
training becomes inefficient |
Incorrect feature
importance |
Larger values appear
more “important” |
Misleading feature
ranking |
📊 Step 1: Sample Dataset
python
import
pandas as pd
df
= pd.DataFrame({
'Age': [25, 45, 35, 50, 23],
'Income': [50000, 80000, 60000, 120000,
40000]
})
⚙️ Step 2: Standardization with
StandardScaler
Standardization transforms features to have:
▶ Formula:
z =
(x-µ)- σ
python
from
sklearn.preprocessing import StandardScaler
scaler
= StandardScaler()
df_scaled
= scaler.fit_transform(df)
df_standardized
= pd.DataFrame(df_scaled, columns=df.columns)
📉 Step 3: Min-Max
Normalization with MinMaxScaler
Min-Max Scaling rescales values to a [0, 1]
range.
▶ Formula:
xscaled=(x−xmin)
/ (xmax−xmin)
python
from
sklearn.preprocessing import MinMaxScaler
scaler
= MinMaxScaler()
df_minmax
= scaler.fit_transform(df)
df_normalized
= pd.DataFrame(df_minmax, columns=df.columns)
🧪 Step 4: Robust Scaling
with RobustScaler
RobustScaler uses median and IQR instead of
mean and standard deviation. It is resistant to outliers.
python
from
sklearn.preprocessing import RobustScaler
scaler
= RobustScaler()
df_robust
= scaler.fit_transform(df)
df_robust_scaled
= pd.DataFrame(df_robust, columns=df.columns)
🧮 Step 5: MaxAbs Scaling
with MaxAbsScaler
Scales features to the [-1, 1] range by dividing by
their maximum absolute value.
python
from
sklearn.preprocessing import MaxAbsScaler
scaler
= MaxAbsScaler()
df_maxabs
= scaler.fit_transform(df)
df_maxabs_scaled
= pd.DataFrame(df_maxabs, columns=df.columns)
✨ Step 6: Normalizing a Single
Row (Unit Vector)
Use Normalizer when you need to transform rows into
unit vectors (sum of squares = 1). Useful in text classification (TF-IDF
vectors).
python
from
sklearn.preprocessing import Normalizer
scaler
= Normalizer()
df_normalized_rows
= scaler.fit_transform(df)
df_norm_row
= pd.DataFrame(df_normalized_rows, columns=df.columns)
📊 Summary Table: Scaling
Methods
Scaler |
Use Case |
Handles Outliers? |
Output Range |
StandardScaler |
Most ML models (e.g.,
SVM, Logistic Regression) |
❌ |
Mean=0, Std=1 |
MinMaxScaler |
Neural
networks, KNN |
❌ |
[0, 1] |
RobustScaler |
Data with many
outliers |
✅ |
Centered by median |
MaxAbsScaler |
Sparse
datasets |
❌ |
[-1, 1] |
Normalizer |
Normalize rows (not
columns) |
❌ |
Unit norm (L2=1) |
🧠 Step 7: When to Use
Which Scaler?
Scenario |
Best Scaler |
You have outliers |
RobustScaler |
You need values between 0 and 1 |
MinMaxScaler |
Most models like
Logistic Regression |
StandardScaler |
Sparse data (many 0s) |
MaxAbsScaler |
Text vectorization
(TF-IDF, L2 norm) |
Normalizer |
🛠 Step 8: Scaling in a
Machine Learning Pipeline
python
from
sklearn.pipeline import Pipeline
from
sklearn.ensemble import RandomForestRegressor
pipe
= Pipeline([
('scale', StandardScaler()),
('model', RandomForestRegressor())
])
Integrating scaling into pipelines ensures no data
leakage between training and testing.
🧪 Step 9: Apply to
Selected Columns Only
python
from
sklearn.compose import ColumnTransformer
from
sklearn.preprocessing import StandardScaler
from
sklearn.linear_model import LinearRegression
numeric_features
= ['Age', 'Income']
preprocessor
= ColumnTransformer(transformers=[
('num', StandardScaler(), numeric_features)
])
model
= Pipeline(steps=[
('preprocess', preprocessor),
('regressor', LinearRegression())
])
🚫 Common Mistakes and
Fixes
Mistake |
Fix |
Scaling test set
with different stats |
Always use transform()
after fit() |
Scaling categorical columns |
Apply scaling
only to numeric features |
Applying scaler
before train-test split |
Always split data before
scaling |
Using .fit_transform() on both sets |
Use .fit() on
training, .transform() on test set |
📉 Before and After
Example
▶ Original
Age |
Income |
25 |
50000 |
45 |
80000 |
35 |
60000 |
▶ After Min-Max Scaling
Age |
Income |
0.0 |
0.0 |
1.0 |
1.0 |
0.5 |
0.333... |
🧠 Best Practices
🏁 Conclusion
Scaling and normalization are more than just a preprocessing
step — they’re essential for model reliability and performance. Without
them, even the most powerful algorithms can behave poorly. Whether you're
training a neural network or clustering customer data, make sure your numeric
features speak the same scale.
With Scikit-learn’s tools and a clear understanding of each
technique, you can scale your data confidently and efficiently — and get one
step closer to cleaner, smarter machine learning.
Answer: Data cleaning is the process of identifying and correcting (or removing) corrupt, inaccurate, or irrelevant records from a dataset. In Python, it ensures that the data is structured, consistent, and ready for analysis or modeling. Clean data improves the reliability and performance of machine learning models and analytics.
Answer: The most popular libraries include:
Answer: Use df.isnull() to detect missing values. You can drop them using df.dropna() or fill them with appropriate values using df.fillna(). For advanced imputation, SimpleImputer from Scikit-learn can be used.
Answer: Use df.drop_duplicates() to remove exact duplicate rows. To drop based on specific columns, you can use df.drop_duplicates(subset=['column_name']).
Answer: You can use statistical methods like Z-score or IQR to detect outliers. Once detected, you can either remove them or cap/floor the values based on business needs using np.where() or conditional logic in Pandas.
Answer:
Answer: Use pd.to_datetime(df['column']) to convert strings to datetime. Similarly, use astype() for converting numerical or categorical types (e.g., df['age'].astype(int)).
Answer: Common steps include:
Answer: Machine learning algorithms typically require numerical inputs. Encoding (like Label Encoding or One-Hot Encoding) converts categorical text into numbers so that algorithms can interpret and process them effectively.
Please log in to access this content. You will be redirected to the login page shortly.
LoginReady to take your education and career to the next level? Register today and join our growing community of learners and professionals.
Comments(0)