Training set

The training set is the portion of the data used to fit the parameters of a model. The model sees the training examples, computes a loss, and uses Gradient descent to adjust its parameters toward smaller loss on this data.

A model has every incentive to memorize the training set, since that’s what minimizing the loss amounts to, so its performance on the training set tells us almost nothing about how it’ll behave on data it hasn’t seen. The training set is for learning, not for evaluation.

For evaluation, set aside a separate Test set before training begins, never let the model see it during training, and only use it at the end to estimate generalization performance. The typical split is 70-80% for training and 20-30% for test.

Often we go further and subdivide the training set into:

A training portion used to fit the model parameters by gradient descent.
A validation portion used to compare different model configurations and pick hyperparameters (learning rate, polynomial degree, regularization strength).

So the full arrangement is train / validate / test: train fits the parameters, validate tunes the hyperparameters, test estimates generalization. We don’t tune anything after seeing the test set; that would contaminate the evaluation.

In scikit-learn, train_test_split separates the data:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, shuffle=True, random_state=0
)

For cross-validation, the train/validate split is repeated multiple times with different folds, which gives a steadier estimate when the dataset is small.

The discipline around training and test data is what keeps machine-learning evaluation honest. The common ways to break it (preprocessing the entire dataset before splitting, peeking at the test set during development, repeatedly tuning against the test set) all amount to Data leakage and produce models that look good in evaluation and fail in deployment.

Idriss Rami — Notes

Explorer

Training set

Graph View

Backlinks