Model Evaluation Techniques in ML

0 0 0 0 0

📕 Chapter 4: Evaluation in Imbalanced and Noisy Datasets

🎯 Objective

This chapter focuses on evaluating machine learning models in challenging conditions — specifically imbalanced datasets (where one class dominates) and noisy datasets (with corrupted or mislabeled data). These scenarios are common in real-world applications like fraud detection, medical diagnoses, and anomaly detection.

Standard evaluation metrics like accuracy fail in these cases, so this chapter explores robust strategies and metrics designed for imbalanced and noisy data.


🔍 Understanding Imbalanced Datasets

In an imbalanced dataset, the majority class heavily outweighs the minority class. A model that predicts everything as the majority class could still achieve high accuracy — but be useless.

Example: In a dataset where only 1% of transactions are fraud, a model predicting “not fraud” for everything will be 99% accurate — but completely ineffective.


️ Problems with Standard Accuracy

  • Hides poor minority class performance
  • Encourages bias toward majority class
  • Doesn’t differentiate error severity

Better Metrics for Imbalanced Datasets


1. Precision, Recall, and F1 Score

  • Precision evaluates how many predicted positives are actually correct.
  • Recall measures how many actual positives the model found.
  • F1 Score balances the two.

These metrics help in fraud detection, disease prediction, and rare event classification.


2. Confusion Matrix Insights

Confusion matrices become even more valuable for imbalanced data. Focus on:

  • False negatives: Missed actual positives
  • False positives: Incorrect alerts

3. Precision-Recall (PR) Curve

More informative than the ROC curve in imbalanced settings. The PR curve plots precision vs recall, showing performance across various thresholds.


4. ROC Curve and AUC

While the ROC curve still works, it can be misleading in imbalanced data. The AUC should be interpreted with caution — always compare it with PR AUC.


5. G-Mean and Balanced Accuracy

These metrics consider the balance between sensitivity (recall for positive class) and specificity (recall for negative class).

Screenshot 2025-05-05 114005


🧪 Sampling Techniques


Oversampling the Minority Class

SMOTE (Synthetic Minority Oversampling Technique) creates synthetic examples of the minority class. This boosts recall but can cause overfitting.


Undersampling the Majority Class

Removes samples from the majority class to rebalance the dataset. It helps with training speed but may discard valuable data.


Combined Sampling

Uses both over- and under-sampling to balance class distribution.


🤖 Ensemble Methods for Imbalanced Data

  • Random Forest with Class Weights: Adjust class weights to penalize majority class.
  • XGBoost: Supports imbalanced class weights with built-in hyperparameters.
  • BalancedBaggingClassifier: Applies bootstrapping with balance-aware sampling.

🧠 Evaluating Noisy Datasets

Noise refers to irrelevant, mislabeled, or inconsistent data. Label noise is especially harmful in supervised learning.


Types of Noise

  • Attribute noise: Incorrect or distorted input features
  • Label noise: Incorrect class assignments

🧼 Strategies to Handle Noisy Data


1. Robust Evaluation Metrics

Use metrics less sensitive to outliers, like MAE in regression or median absolute error.


2. Noise Detection and Filtering

Apply noise filtering methods like:

  • k-NN label filtering
  • Consensus voting from ensemble models
  • Rule-based filters

3. Data Cleaning with Domain Knowledge

Leverage expert input to flag or remove suspicious records, especially in high-stakes fields like healthcare.


4. Use of Robust Models

Algorithms like Random Forest, Gradient Boosting, or RANSAC Regression are more resilient to noise.


📊 Comparison Table Summary

Technique

Use Case

Strengths

Limitations

PR Curve

Imbalanced classification

Highlights positive class performance

Less intuitive for non-specialists

SMOTE

Minority oversampling

Boosts recall

Risk of overfitting

ROC AUC

General performance

Widely used

Can be misleading on skewed data

Noise Filtering

Noisy/mislabeled datasets

Improves model quality

May remove rare edge cases

G-Mean

Balanced evaluation

Considers both sensitivity and specificity

Harder to interpret than F1


Tips and Best Practices


  • Use stratified sampling during cross-validation to preserve class distribution
  • Always compare ROC AUC with PR AUC in imbalanced classification
  • Prefer F1 score over accuracy when classes are imbalanced
  • Use log transformations to minimize noise in skewed numeric data
  • Visualize decision boundaries and residuals to detect noise and misclassification

Back

FAQs


1. Why is model evaluation important in machine learning?

Model evaluation ensures that your model not only performs well on training data but also generalizes effectively to new, unseen data. It helps prevent overfitting and guides model selection.

2. What is the difference between training accuracy and test accuracy?

Training accuracy measures performance on the data used to train the model, while test accuracy evaluates how well the model generalizes to new data. High training accuracy but low test accuracy often indicates overfitting.

3. What is the purpose of a confusion matrix?

A confusion matrix summarizes prediction results for classification tasks. It breaks down true positives, true negatives, false positives, and false negatives, allowing detailed error analysis.

4. When should I use the F1 score over accuracy?

 Use the F1 score when dealing with imbalanced datasets, where accuracy can be misleading. The F1 score balances precision and recall, offering a better sense of performance in such cases.

5. How does cross-validation improve model evaluation?

Cross-validation reduces variance in model evaluation by testing the model on multiple folds of the dataset. It provides a more reliable estimate of model performance than a single train/test split.

6. What is the ROC AUC score?

ROC AUC measures the model’s ability to distinguish between classes across different thresholds. A score closer to 1 indicates excellent discrimination, while 0.5 implies random guessing.

7. What’s the difference between MAE and RMSE in regression?

MAE calculates the average absolute errors, treating all errors equally. RMSE squares the errors, giving more weight to larger errors. RMSE is more sensitive to outliers.

8. Why is adjusted R² better than regular R²?

Adjusted R² accounts for the number of predictors in a model, making it more reliable when comparing models with different numbers of features. It penalizes unnecessary complexity.

9. What’s a good silhouette score?

A silhouette score close to 1 indicates well-separated clusters in unsupervised learning. Scores near 0 suggest overlapping clusters, and negative values imply poor clustering.

10. Can model evaluation metrics vary between domains?

Yes, different problems require different metrics. For example, in medical diagnosis, recall might be more critical than accuracy, while in financial forecasting, minimizing RMSE may be preferred.