The validation set is a portion of the training data set aside for hyperparameter tuning and model selection. It’s distinct from both the Training set (which fits parameters) and the Test set (which estimates final generalization).
The three-way split:
- Training set — fit the model’s parameters via Gradient descent.
- Validation set — evaluate different model configurations, pick hyperparameters like learning rate, polynomial degree, regularization strength.
- Test set — locked away until the end. Used once, for the final unbiased estimate of generalization performance.
Why the separate validation set: if we tuned hyperparameters against the test set, we’d implicitly be optimizing against it. After enough tuning rounds, the test set would no longer measure generalization to unseen data. It would measure how well we tuned to that particular test set. So the test set has to be locked away.
A typical workflow:
- Split the data: 20% test, 80% train+validation.
- Within the train+validation portion, split again: 75% train, 25% validation (= 60% / 20% of the whole).
- Train on the training portion. Evaluate on validation. Try a different hyperparameter. Train again. Evaluate again. Iterate.
- Pick the configuration with the best validation performance.
- Retrain on all of train+validation using that configuration.
- Evaluate once on the test set. Report that number.
For small datasets where a separate validation set would shrink the training set too much, K-fold cross-validation is the standard alternative. It splits the train+validation portion into folds and rotates which one is used for validation, training models. The average validation score across folds is a steadier estimator than a single train/validate split.
In scikit-learn, you can either do a manual train/validation split with train_test_split (called twice — once to peel off the test set, once to split the rest) or use cross-validation helpers like cross_val_score and GridSearchCV which handle the splits internally.
The validation/test distinction is what keeps the final evaluation honest. Without it, the test number is contaminated by all the hyperparameter tuning that came before it.