Train-test split

A train-test split divides the dataset into two disjoint portions: a Training set for fitting the model and a Test set for the final evaluation. A typical ratio is 70-80% training, 20-30% test. The exact split is empirical. Small datasets keep the test set small to leave more for training; large datasets can afford a bigger test set for a more precise generalization estimate.

In scikit-learn:

from sklearn.model_selection import train_test_split
 
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.3,        # 30% to test, 70% to train
    shuffle=True,         # randomize order before splitting
    random_state=0        # fix the random seed for reproducibility
)

The arguments:

test_size=0.3 allocates 30% to test. The complement (70%) goes to training. You can pass an integer instead of a fraction to specify an exact number of test examples.

shuffle=True randomizes the order before splitting. This matters when the original dataset has ordered structure, say all the high-quality wines clustered at the end. Without shuffling, the test set might contain only one class, which is useless for evaluation.

random_state=0 fixes the random seed, so the same split happens every time the code is run. Without it, every run gives a different split and slightly different results, which makes debugging and comparison nearly impossible.

stratify=y (not shown above) ensures the class proportions in train and test match those in the full dataset. This matters when classes are imbalanced: without stratification, a 95/5 split could end up with the rare class entirely in the training set or entirely in the test set.

Beyond a simple split

For Hyperparameter tuning, we usually want a three-way split with a separate Validation set. The clean way is to call train_test_split twice:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25)
# Now: 60% train, 20% validation, 20% test

For small datasets, K-fold cross-validation is more sample-efficient: train and validate multiple times on different splits and average the results.

Avoiding data leakage

The point of the split is to keep test data unseen during training. All preprocessing (Normalization, Imputation, Feature extraction) must be fit on the training set only and then applied to the test set using the training-fitted parameters. Fitting preprocessing on the entire dataset before the split contaminates the test set with information from the test set itself. This is Data leakage.

The cleanest way to enforce this is a scikit-learn pipeline:

from sklearn.pipeline import make_pipeline
clf = make_pipeline(StandardScaler(), LogisticRegression())
clf.fit(X_train, y_train)        # scaler fits on X_train, then transforms it
y_pred = clf.predict(X_test)     # scaler transforms X_test with stored params

The pipeline handles fit-on-training and transform-on-test automatically. There’s no way for test data to leak into the scaler’s fit.

Idriss Rami — Notes

Explorer

Train-test split

Beyond a simple split

Avoiding data leakage

Graph View

Table of Contents

Backlinks