Lesson 9.2: Handling Imbalanced Data – SMOTE, Undersampling/Oversampling
🔹 What is Imbalanced Data?
Imbalanced data occurs when the classes in a dataset are not equally represented.
-
Example: Fraud detection → 1% fraudulent, 99% non-fraudulent.
-
Can cause models to favor majority class, reducing predictive performance.
🔹 Techniques to Handle Imbalanced Data
-
Oversampling
-
Increase the number of minority class samples.
-
Example: Random duplication of minority class data.
-
Undersampling
-
Reduce the number of majority class samples.
-
Example: Randomly remove samples from majority class.
-
SMOTE (Synthetic Minority Over-sampling Technique)
-
Generates synthetic samples for the minority class using nearest neighbors.
🔹 Example (Using SMOTE)
-
X_resandy_res→ Balanced dataset -
Works well for imbalanced classification problems
🔹 Advantages
-
Prevents model from biasing towards majority class.
-
SMOTE generates synthetic examples rather than duplication.
🔹 Disadvantages
-
Oversampling → Can overfit minority class.
-
Undersampling → May lose important information.
-
SMOTE → May create noise if not applied carefully.
✅ Quick Recap:
-
Imbalanced data → Classes not equally represented.
-
Handle using Oversampling, Undersampling, SMOTE to improve model fairness and accuracy.
