Lesson 3.4: Handling Outliers – IQR, Z-Score
Outliers are data points that differ significantly from the rest of the dataset. They can distort statistical analysis and affect the performance of machine learning models, so detecting and handling them is important.
🔹 What are Outliers?
-
Outliers are extreme values that don’t follow the general trend of the data.
-
Example: In a dataset of people’s heights (mostly between 150–190 cm), a value of 250 cm would be an outlier.
🔹 Methods to Detect and Handle Outliers
-
Interquartile Range (IQR) Method
-
IQR = Q3 – Q1 (difference between the 75th percentile and 25th percentile).
-
Outliers are values that fall below Q1 – 1.5 × IQR or above Q3 + 1.5 × IQR.
-
Action: Remove or cap these values.
Example in Python:
-
-
Z-Score Method
-
Z-score measures how many standard deviations a value is from the mean.
-
Formula:
Z=(x−μ)σZ = \frac{(x – \mu)}{\sigma}
where μ\mu = mean, σ\sigma = standard deviation.
-
If |Z| > 3, the point is considered an outlier.
Example in Python:
-
🔹 How to Handle Outliers?
-
Remove them if they are errors or irrelevant.
-
Cap/limit values to reduce their effect.
-
Transform data (e.g., log transformation) to reduce skewness.
✅ In summary:
-
Outliers can affect the accuracy of data analysis.
-
IQR is used for skewed distributions.
-
Z-Score is used for normally distributed data.
