Lesson 3.7: Splitting Data – Train/Test Split, Cross Validation
🔹 Train/Test Split
When building a Machine Learning model, we need to evaluate its performance on new, unseen data.
For this, the dataset is divided into:
-
Training Set → Used to train the model.
-
Testing Set → Used to evaluate how well the model performs.
👉 Example using Scikit-learn:
✅ This ensures that the model does not memorize the data but learns patterns.
🔹 Cross Validation
Instead of a single train-test split, Cross Validation (CV) improves model evaluation.
-
The dataset is split into k parts (folds).
-
The model is trained on k-1 folds and tested on the remaining fold.
-
This process is repeated k times, and results are averaged.
👉 Example using K-Fold Cross Validation:
✅ CV helps ensure the model works well on different subsets of the data.
📌 Summary
-
Train/Test Split → Quick evaluation (e.g., 80% train, 20% test).
-
Cross Validation → More reliable evaluation, reduces risk of overfitting/underfitting.
-
Both techniques are essential in Machine Learning pipelines.
