Lesson 9.2: Handling Imbalanced Data – SMOTE, Undersampling/Oversampling - Raman Digital Institute

Skip to content

Home
Career Guidance
All Courses
Jobs, Startups & Skills
About Us
Contact Us
Login

Search for:

Search for:

Course Content

Module 1: Introduction to Data Science

This module introduced Data Science basics, its applications and career scope. We learned the role of a Data Scientist, their skills & responsibilities. The workflow (collection → cleaning → analysis → modeling → deployment) was explained. We also saw common tools (Python, R, SQL, Jupyter) and the difference between Data Science, AI, ML & Deep Learning.

0/5

Lesson 1.1: What is Data Science? – Definition, Applications & Career Scope

Lesson 1.2: Role of Data Scientist – Skills & Responsibilities

Lesson 1.3: Data Science Workflow – Data Collection → Cleaning → Analysis → Modeling → Deployment

Lesson 1.4: Tools and Technologies Used in Data Science (Python, R, Jupyter, SQL, etc.)

Lesson 1.5: Difference between Data Science, AI, ML, and Deep Learning

Module 2: Python for Data Science

In this module, you learned the fundamentals of Python programming tailored for Data Science. You explored Python basics, control structures, functions, and built-in data structures. You also mastered file handling, exception handling, and essential data science libraries such as NumPy (arrays & computations), Pandas (data manipulation & cleaning), and Matplotlib/Seaborn (data visualization). 👉 After completing this module, you are now ready to analyze, clean, and visualize real-world datasets using Python.

0/10

Lesson 2.1: Python Basics – Variables, Data Types, Operators

Lesson 2.2: Control Structures – if-else, loops

Lesson 2.3: Functions and Modules in Python

Lesson 2.4: Data Structures in Python (List, Tuple, Set, Dictionary)

Lesson 2.5: File Handling & Exception Handling in Python

Lesson 2.6: NumPy – Arrays & Mathematical Operations

Lesson 2.7: Pandas – DataFrames & Data Analysis

Lesson 2.8: Matplotlib & Data Visualization

Lesson 2.9: Seaborn for Advanced Visualization

Lesson 2.10: Pandas Advanced Operations (Grouping, Merge, Join, Pivot Tables)

Module 3: Data Handling & Preprocessing

In this module, you learned how to prepare raw data for Machine Learning models: Introduction to NumPy & Pandas → Efficient libraries for data manipulation. Importing & Exploring Data → Loading datasets, checking structure, missing values. Data Cleaning → Handling missing values, duplicates, and inconsistencies. Feature Engineering → Creating new features, scaling & normalization. Encoding Categorical Data → One-hot encoding, label encoding. Handling Outliers → Detecting and treating unusual data points. Splitting Data → Train/Test Split & Cross Validation for model evaluation. ✅ By the end of this module, you now understand how to clean, transform, and prepare datasets so that ML models can learn effectively.

0/7

Lesson 3.1: Understanding Data – Structured vs Unstructured

Lesson 3.2: Data Collection Methods – APIs, Web Scraping, Databases

Lesson 3.3: Handling Missing Data – Mean/Median, Interpolation, Dropping

Lesson 3.4: Handling Outliers – IQR, Z-Score

Lesson 3.5: Data Encoding – One Hot Encoding, Label Encoding

Lesson 3.6: Feature Scaling – Normalization, Standardization

Lesson 3.7: Splitting Data – Train/Test Split, Cross Validation

Module 4: Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) helps to understand data using statistics and visualizations. It identifies patterns, trends, correlations, and anomalies before model building.

0/6

Lesson 4.1: Introduction to EDA – Why and How

Lesson 4.2: Descriptive Statistics – Mean, Median, Mode, Variance, Std Dev

Lesson 4.3: Data Visualization – Histogram, Scatter Plot, Box Plot, Heatmaps

Lesson 4.4: Correlation Analysis

Lesson 4.5: Identifying Patterns and Trends in Data

Lesson 4.6: Hands-on EDA Project (Titanic Dataset Example)

Module 5: Statistics & Probability for Data Science

In this module, you will learn the fundamentals of statistics and probability that form the backbone of data science. You’ll explore how to work with population and samples, understand probability distributions like Normal, Binomial, and Poisson, and perform hypothesis testing with p-values. You will also study confidence intervals, advanced tests like ANOVA and Chi-square, and finally learn to distinguish between correlation and causation. By the end of this module, you’ll have the statistical knowledge required to analyze data rigorously and make reliable, data-driven decisions.

0/6

Lesson 5.1: Basics of Statistics – Population vs Sample

Lesson 5.2: Probability & Probability Distributions (Normal, Binomial, Poisson)

Lesson 5.3: Hypothesis Testing – Null & Alternative Hypothesis, p-value

Lesson 5.4: Confidence Intervals

Lesson 5.5: ANOVA & Chi-square Test

Lesson 5.6: Correlation vs Causation

Module 6: Introduction to Machine Learning

This module introduces the fundamentals of Machine Learning (ML) – the science of building algorithms that learn from data. You will learn what ML is, its main types, the typical workflow of ML projects, and important concepts like bias, variance, underfitting, overfitting, and validation techniques. By the end, you’ll have a clear foundation for understanding and applying ML models.

0/4

Lesson 6.1: What is Machine Learning? – Definition & Types

Lesson 6.2: ML Workflow – Training, Testing, Evaluation

Lesson 6.3: Bias vs Variance – Underfitting & Overfitting

Lesson 6.4: Train/Test Split & Cross-Validation

Module 7: Supervised Learning Algorithms

This module covers Supervised Learning, where models learn from labeled data to make predictions. You will learn popular regression and classification algorithms, including Linear Regression, Logistic Regression, KNN, Decision Trees, Random Forest, SVM, and Naive Bayes. You’ll also study evaluation metrics for both regression and classification problems to measure model performance accurately. By the end of this module, you’ll be able to apply supervised learning algorithms to real-world datasets and evaluate their performance.

0/11

Lesson 7.1: Linear Regression

Lesson 7.2: Multiple Linear Regression

Lesson 7.3: Polynomial Regression

Lesson 7.4: Evaluation Metrics for Regression (MAE, MSE, R²)

Lesson 7.5: Logistic Regression

Lesson 7.6: K-Nearest Neighbors (KNN)

Lesson 7.7: Decision Trees

Lesson 7.8: Random Forest

Lesson 7.9: Support Vector Machine (SVM)

Lesson 7.10: Naive Bayes

Lesson 7.11: Evaluation Metrics for Classification (Accuracy, Precision, Recall, F1, ROC, AUC)

Module 8: Unsupervised Learning Algorithms

This module introduces Unsupervised Learning, where models learn from unlabeled data to find hidden patterns, clusters, or associations. You will explore popular clustering algorithms like K-Means, Hierarchical, and DBSCAN, understand dimensionality reduction using PCA, and learn association rule mining techniques such as Apriori for market basket analysis. By the end of this module, you’ll be able to group similar data, reduce complexity, and discover meaningful relationships in datasets.

0/5

Lesson 8.1: K-Means Clustering

Lesson 8.2: Hierarchical Clustering

Lesson 8.3: DBSCAN Clustering

Lesson 8.4: Dimensionality Reduction – PCA

Lesson 8.5: Association Rule Learning – Apriori, Market Basket Analysis

Module 9: Feature Engineering & Model Improvement

This module focuses on enhancing model performance through feature engineering and optimization techniques. You will learn how to select important features, handle imbalanced data, apply regularization, tune hyperparameters, and use advanced ensemble learning methods like Bagging, Boosting (AdaBoost, XGBoost, LightGBM) to improve model accuracy and robustness. By the end of this module, you’ll be able to build more accurate and generalizable models for real-world datasets.

0/5

Lesson 9.1: Feature Selection Techniques

Lesson 9.2: Handling Imbalanced Data – SMOTE, Undersampling/Oversampling

Lesson 9.3: Regularization – L1 (Lasso), L2 (Ridge)

Lesson 9.4: Hyperparameter Tuning – Grid Search, Random Search

Lesson 9.5: Ensemble Learning – Bagging, Boosting (AdaBoost, XGBoost, LightGBM)

Module 10: Neural Networks & Deep Learning (Basics)

This module introduces the fundamentals of Neural Networks and Deep Learning. You will learn about neurons, perceptrons, activation functions, forward and backward propagation, and get hands-on experience with TensorFlow/Keras to build a simple neural network. By the end of this module, you’ll understand how deep learning models process data and make predictions, laying the foundation for advanced neural network architectures.

0/5

Lesson 10.1: What is Neural Network? – Neurons & Perceptron

Lesson 10.2: Activation Functions – Sigmoid, ReLU, Softmax

Lesson 10.3: Forward Propagation & Backpropagation (Conceptual)

Lesson 10.4: Introduction to TensorFlow/Keras

Lesson 10.5: Building a Simple Neural Network

Module 11: Working with Real-World Data

This module focuses on applying data science and machine learning concepts to real-world datasets. You will explore datasets from Kaggle and UCI, and complete hands-on projects including regression (house prices), classification (Titanic survival), and clustering (customer segmentation). By the end of this module, you’ll gain practical experience in handling, analyzing, and modeling real-world data, preparing you for professional data science tasks.

0/4

Lesson 11.1: Introduction to Kaggle & UCI Datasets

Lesson 11.2: Project 1 – Predicting House Prices (Regression)

Lesson 11.3: Project 2 – Titanic Survival Prediction (Classification)

Lesson 11.4: Project 3 – Customer Segmentation (Clustering)

Module 12: Model Deployment (Basics)

This module introduces the basics of deploying machine learning models so that they can be used in real-world applications. You will learn how to save trained models, and deploy them using Flask or Streamlit for interactive web-based applications. By the end of this module, you’ll understand how to make your ML models accessible and usable beyond local environments.

0/4

Lesson 12.1: Introduction to Deployment

Lesson 12.2: Saving Models with Pickle/Joblib

Lesson 12.3: Deploying ML Models with Flask / Streamlit

Lesson 12.4: Hosting Models on Cloud (Heroku, AWS – Basic Intro)

Module 13: Ethics & Future of Data Science

This module focuses on the ethical, social, and professional aspects of data science and machine learning. You will learn about data privacy, security, bias, fairness, and explainable AI (XAI). The module also provides guidance on career paths, skills, and opportunities in the data science field. By the end of this module, you’ll understand the responsible and ethical use of data and be aware of future trends and career growth.

0/4

Lesson 13.1: Data Privacy & Security Issues

Lesson 13.2: Bias and Fairness in Machine Learning

Lesson 13.3: Explainable AI (XAI) – Why it Matters

Lesson 13.4: Career Paths in Data Science & ML

Data Science & Machine Learning – Final Assessment

Test your knowledge and skills from all modules of this course. This assessment evaluates your understanding of Python, data handling, ML algorithms, model deployment, and ethical AI practices.

0/1

Final Multiple Choice Questions (MCQ)

Data Science and Machine Learning Basics

Lesson 9.2: Handling Imbalanced Data – SMOTE, Undersampling/Oversampling

🔹 What is Imbalanced Data?

Imbalanced data occurs when the classes in a dataset are not equally represented.

Example: Fraud detection → 1% fraudulent, 99% non-fraudulent.
Can cause models to favor majority class, reducing predictive performance.

🔹 Techniques to Handle Imbalanced Data

Oversampling

Increase the number of minority class samples.
Example: Random duplication of minority class data.

Undersampling

Reduce the number of majority class samples.
Example: Randomly remove samples from majority class.

SMOTE (Synthetic Minority Over-sampling Technique)

Generates synthetic samples for the minority class using nearest neighbors.

🔹 Example (Using SMOTE)

from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42) X_res, y_res = smote.fit_resample(X_train, y_train)

X_res and y_res → Balanced dataset
Works well for imbalanced classification problems

🔹 Advantages

Prevents model from biasing towards majority class.
SMOTE generates synthetic examples rather than duplication.

🔹 Disadvantages

Oversampling → Can overfit minority class.
Undersampling → May lose important information.
SMOTE → May create noise if not applied carefully.

✅ Quick Recap:

Imbalanced data → Classes not equally represented.
Handle using Oversampling, Undersampling, SMOTE to improve model fairness and accuracy.

Home
Career Guidance
All Courses
Jobs, Startups & Skills
About Us
Contact Us
Login

Our Mission & Vision
Terms and Conditions
Privacy Policy
Refund Policy
Disclaimer
FAQs

Copyright © 2026 Raman Digital Institute

Scroll to Top