Lesson 3.5: Data Encoding – One Hot Encoding, Label Encoding
Introduction
Most machine learning models work with numerical data. However, in real-world datasets, we often encounter categorical data (e.g., gender = Male/Female, city = Delhi/Mumbai/Kolkata). To use such data in models, we need to convert categorical values into numerical form without losing information. This process is called Data Encoding.
Two commonly used methods are:
-
Label Encoding
-
One Hot Encoding
1. Label Encoding
-
Assigns a unique integer value to each category.
-
Example:
City Encoded Delhi 0 Mumbai 1 Kolkata 2
🔹 Pros: Simple, memory efficient.
🔹 Cons: May create false ordinal relationships (e.g., model may think Mumbai > Delhi).
Example in Python:
Output:
2. One Hot Encoding
-
Creates a binary column (0/1) for each category.
-
Example:
City Delhi Mumbai Kolkata Delhi 1 0 0 Mumbai 0 1 0 Kolkata 0 0 1
🔹 Pros: No false order, better for categorical variables.
🔹 Cons: Increases dataset size (especially for many categories).
Example in Python:
Output:
When to Use?
-
Label Encoding → Good for ordinal data (e.g., education level: High School < Graduate < Postgraduate).
-
One Hot Encoding → Better for nominal data (no order, e.g., city names, colors).
✅ Summary:
-
Data encoding converts categorical values into numerical format.
-
Label Encoding replaces categories with numbers.
-
One Hot Encoding creates separate binary columns for each category.
-
Choice depends on whether the data is ordinal or nominal.
