The validation set is a portion of the training data set aside for hyperparameter tuning and model selection — distinct from both the Training set (which fits parameters) and the Test set (which estimates final generalization).
The three-way split:
- Training set — fit the model’s parameters via Gradient descent.
- Validation set — evaluate different model configurations, pick hyperparameters like learning rate, polynomial degree, regularization strength.
- Test set — locked away until the end. Used once, for the final unbiased estimate of generalization performance.
The reason for the separate validation set: if we tuned hyperparameters against the test set, we’d implicitly be optimizing against it. After enough tuning rounds, the test set would no longer measure generalization to unseen data — it would measure how well we tuned to that particular test set. The test set has to be locked away.
A typical workflow:
- Split the data: 20% test, 80% train+validation.
- Within the train+validation portion, split again: 75% train, 25% validation (= 60% / 20% of the whole).
- Train on the training portion. Evaluate on validation. Try a different hyperparameter. Train again. Evaluate again. Iterate.
- Pick the configuration with the best validation performance.
- Retrain on all of train+validation using that configuration.
- Evaluate once on the test set. Report that number.
For small datasets where a separate validation set would shrink the training set too much, K-fold cross-validation is the standard alternative. It splits the train+validation portion into folds and rotates which one is used for validation, training models. The average validation score across folds is a more robust estimator than a single train/validate split.
In scikit-learn, you can either do a manual train/validation split with train_test_split (called twice — once to peel off the test set, once to split the rest) or use cross-validation helpers like cross_val_score and GridSearchCV which handle the splits internally.
The discipline matters because the validation/test distinction is what keeps the final evaluation honest. Without it, the test number is contaminated by all the hyperparameter tuning that came before it.