Data Science and Machine Learning Basics

Lesson 3.7: Splitting Data – Train/Test Split, Cross Validation

🔹 Train/Test Split

When building a Machine Learning model, we need to evaluate its performance on new, unseen data.
For this, the dataset is divided into:

👉 Example using Scikit-learn:

✅ This ensures that the model does not memorize the data but learns patterns.

Instead of a single train-test split, Cross Validation (CV) improves model evaluation.

👉 Example using K-Fold Cross Validation:

✅ CV helps ensure the model works well on different subsets of the data.

Train/Test Split → Quick evaluation (e.g., 80% train, 20% test).
Cross Validation → More reliable evaluation, reduces risk of overfitting/underfitting.
Both techniques are essential in Machine Learning pipelines.