A train-test split divides the dataset into two disjoint portions: a Training set for fitting the model and a Test set for the final evaluation. A typical ratio is 70-80% training, 20-30% test. The exact split is empirical — small datasets keep the test set small to leave more for training; large datasets can afford a bigger test set for a more precise generalization estimate.
In scikit-learn:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.3, # 30% to test, 70% to train
shuffle=True, # randomize order before splitting
random_state=0 # fix the random seed for reproducibility
)A few things in this call are worth knowing:
test_size=0.3 allocates 30% to test. The complement (70%) goes to training. You can pass an integer instead of a fraction to specify an exact number of test examples.
shuffle=True randomizes the order before splitting. This matters when the original dataset has ordered structure — say, all the high-quality wines clustered at the end. Without shuffling, the test set might contain only one class, which is useless for evaluation.
random_state=0 fixes the random seed. The same split happens every time the code is run. This is essential for reproducibility — without it, every run of the script gives a different split and slightly different results, which makes debugging and comparison nearly impossible.
stratify=y (not shown above) ensures the class proportions in train and test match those in the full dataset. Critical when classes are imbalanced — without stratification, a 95/5 split could end up with the rare class entirely in the training set or entirely in the test set.
Beyond a simple split
For Hyperparameter tuning, we usually want a three-way split with a separate Validation set. The clean way is to call train_test_split twice:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25)
# Now: 60% train, 20% validation, 20% testFor small datasets, K-fold cross-validation is more sample-efficient — train and validate multiple times on different splits and average the results.
Avoiding data leakage
The point of the split is to keep test data unseen during training. All preprocessing — Normalization, Imputation, Feature extraction — must be fit on the training set only and then applied to the test set using the training-fitted parameters. Fitting preprocessing on the entire dataset before the split contaminates the test set with information from the test set itself. This is Data leakage.
The cleanest way to enforce this is a scikit-learn pipeline:
from sklearn.pipeline import make_pipeline
clf = make_pipeline(StandardScaler(), LogisticRegression())
clf.fit(X_train, y_train) # scaler fits on X_train, then transforms it
y_pred = clf.predict(X_test) # scaler transforms X_test with stored paramsThe pipeline handles fit-on-training and transform-on-test automatically. There’s no way for test data to leak into the scaler’s fit.