The training set is the portion of the data used to fit the parameters of a model. The model sees the training examples, computes a loss, and uses Gradient descent to adjust its parameters toward smaller loss on this data.

A model has every incentive to memorize the training set — that’s essentially what minimizing the loss amounts to — so its performance on the training set tells us almost nothing about how it’ll behave on data it hasn’t seen. The training set is for learning, not for evaluation.

For evaluation, we need to set aside a separate Test set before training begins, never let the model see it during training, and only use it at the end to estimate generalization performance. The typical split is 70-80% for training and 20-30% for test.

Often we go further and subdivide the training set into:

  • A training portion used to fit the model parameters by gradient descent.
  • A validation portion used to compare different model configurations and pick hyperparameters (learning rate, polynomial degree, regularization strength).

So the full arrangement is train / validate / test: train fits the parameters, validate tunes the hyperparameters, test estimates generalization. We don’t tune anything after seeing the test set — that would contaminate the evaluation.

In scikit-learn, train_test_split separates the data:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, shuffle=True, random_state=0
)

For cross-validation, the train/validate split is repeated multiple times with different folds, which gives a more robust estimate when the dataset is small.

The discipline around training and test data is what keeps machine-learning evaluation honest. The most common ways to break it — preprocessing the entire dataset before splitting, peeking at the test set during development, repeatedly tuning against the test set — all amount to Data leakage and produce models that look good in evaluation and fail in deployment.