Lesson 3.3: Handling Missing Data – Mean/Median, Interpolation, Dropping
In real-world datasets, missing values are very common. If not handled properly, they can lead to incorrect analysis and poor model performance.
1. Why Missing Data Occurs?
-
Human error in data entry.
-
Sensor or device malfunction.
-
Data not recorded for some categories.
-
Errors during data transfer.
2. Techniques to Handle Missing Data
(a) Replacing with Mean/Median/Mode
-
Mean → Works well for numeric data (e.g., replace missing exam scores with the average).
-
Median → Better when data has outliers (e.g., salaries where some values are extremely high).
-
Mode → Good for categorical data (e.g., missing “Gender” replaced with the most common value).
(b) Interpolation
-
Estimating missing values based on trends or neighboring values.
-
Example: If temperature data is missing for one day, you can use the average of the day before and after.
-
Useful in time-series datasets (stock prices, weather data).
(c) Dropping Missing Data
-
If too many values are missing, you can remove the rows or columns.
-
Example: If 90% of a column is empty, it’s better to drop it.
-
But be careful: dropping too much data may reduce accuracy.
(d) Advanced Methods
-
KNN Imputer → Predicts missing values using similar data points.
-
Regression/ML models → Predict missing values based on other variables.
3. Best Practice
-
Always analyze how much data is missing.
-
If less than 5% is missing → fill it.
-
If more than 50% is missing → consider dropping.
-
Document how you handled missing data for transparency.
✅ Summary:
-
Mean/Median/Mode → Simple replacements.
-
Interpolation → Estimate using trends.
-
Dropping → Remove missing-heavy rows/columns.
-
Advanced ML methods → For complex cases.
