The validation set is a portion of the training data set aside for hyperparameter tuning and model selection — distinct from both the Training set (which fits parameters) and the Test set (which estimates final generalization).

The three-way split:

  • Training set — fit the model’s parameters via Gradient descent.
  • Validation set — evaluate different model configurations, pick hyperparameters like learning rate, polynomial degree, regularization strength.
  • Test set — locked away until the end. Used once, for the final unbiased estimate of generalization performance.

The reason for the separate validation set: if we tuned hyperparameters against the test set, we’d implicitly be optimizing against it. After enough tuning rounds, the test set would no longer measure generalization to unseen data — it would measure how well we tuned to that particular test set. The test set has to be locked away.

A typical workflow:

  1. Split the data: 20% test, 80% train+validation.
  2. Within the train+validation portion, split again: 75% train, 25% validation (= 60% / 20% of the whole).
  3. Train on the training portion. Evaluate on validation. Try a different hyperparameter. Train again. Evaluate again. Iterate.
  4. Pick the configuration with the best validation performance.
  5. Retrain on all of train+validation using that configuration.
  6. Evaluate once on the test set. Report that number.

For small datasets where a separate validation set would shrink the training set too much, K-fold cross-validation is the standard alternative. It splits the train+validation portion into folds and rotates which one is used for validation, training models. The average validation score across folds is a more robust estimator than a single train/validate split.

In scikit-learn, you can either do a manual train/validation split with train_test_split (called twice — once to peel off the test set, once to split the rest) or use cross-validation helpers like cross_val_score and GridSearchCV which handle the splits internally.

The discipline matters because the validation/test distinction is what keeps the final evaluation honest. Without it, the test number is contaminated by all the hyperparameter tuning that came before it.