Data leakage is a failure mode where information from the Test set influences training, contaminating the evaluation. When it happens, the test set is no longer measuring generalization to unseen data — it’s measuring something that has already been informed by the test data. The model appears to perform well during evaluation but fails in deployment.

Two scenarios cause leakage often:

Preprocessing on the whole dataset before splitting

Suppose we have the wine-quality dataset and we follow the Chapter 4 recipe: normalize the features, impute missing values, possibly extract features through a rolling window. If we do these operations on the entire dataset and then split into train and test, we’ve leaked. The mean and standard deviation used for normalization were computed using the test data. The imputed values were computed in part from the test data. The test set has influenced the preprocessing, and the model implicitly knows about it.

The fix is simple: split first, fit preprocessing on the training set only, then apply the same transformation to the test set. Compute the normalization statistics from the training set alone, store them in the scaler, transform the training set using them. When the test set arrives, transform it using the same training-fitted statistics — don’t recompute. The scaler has been fit on training and transformed on both. The test set is preprocessed but doesn’t contribute to the preprocessing rules.

This is exactly what scikit-learn’s fit_transform and transform distinction is built for:

sc = StandardScaler()
X_train = sc.fit_transform(X_train)   # fits AND transforms training
X_test  = sc.transform(X_test)         # transforms test using training's stats

Never call fit_transform on the test set, and never call it on the entire dataset before splitting. The clean way to enforce this is a Pipeline:

clf = make_pipeline(StandardScaler(), LogisticRegression())
clf.fit(X_train, y_train)             # internal scaler fits on X_train
y_pred = clf.predict(X_test)          # internal scaler transforms X_test

The same example appearing in both sets

This is subtle and easy to do accidentally:

  • Duplicates in the dataset. If the same row appears twice and one copy ends up in train and the other in test, the test number is inflated.
  • Strong temporal structure. If we have time-series data and split randomly, consecutive samples — which are practically the same recording — can end up in both train and test. The model didn’t learn to predict the future; it learned to predict near-duplicates of its training data.
  • Group structure. If we have multiple samples from the same subject (multiple ECG recordings from the same patient), randomly splitting individual samples puts samples from one subject in both train and test. The model has learned subject-specific patterns and looks good on test data even though it would fail on truly new subjects.

The fix depends on the structure: deduplicate before splitting, split temporally (training on early data, testing on later) for time series, use GroupKFold or split by subject ID for grouped data.

Why it matters

Without leakage, the test score is an honest estimate of how the model will perform on new data. With leakage, the test score is inflated, sometimes dramatically. Models that look great in evaluation and fail in production often have leakage in the development pipeline — the most common reason a deployed model performs worse than expected.

The discipline around leakage is part of why machine-learning evaluation has so much structure: distinct training, validation, and test sets; preprocessing fit only on training; pipelines that enforce the discipline automatically. Each piece is there to prevent some category of leakage.