Data leakage

Data leakage is a failure mode where information from the Test set influences training, contaminating the evaluation. The test set stops measuring generalization to unseen data and starts measuring something already informed by the test data. The model looks good during evaluation but fails in deployment.

Two common causes:

Preprocessing on the whole dataset before splitting

Take the wine-quality dataset and the usual recipe: normalize the features, impute missing values, maybe extract features through a rolling window. Do these on the entire dataset and then split into train and test, and you’ve leaked. The mean and standard deviation used for normalization were computed using the test data. The imputed values were computed in part from the test data. The test set has influenced the preprocessing, and the model implicitly knows about it.

The fix: split first, fit preprocessing on the training set only, then apply the same transformation to the test set. Compute the normalization statistics from the training set alone, store them in the scaler, transform the training set using them. When the test set arrives, transform it using the same training-fitted statistics, don’t recompute. The scaler is fit on training and transformed on both. The test set is preprocessed but doesn’t contribute to the preprocessing rules.

This is exactly what scikit-learn’s fit_transform and transform distinction is built for:

sc = StandardScaler()
X_train = sc.fit_transform(X_train)   # fits AND transforms training
X_test  = sc.transform(X_test)         # transforms test using training's stats

Never call fit_transform on the test set, and never call it on the entire dataset before splitting. The clean way to enforce this is a Pipeline:

clf = make_pipeline(StandardScaler(), LogisticRegression())
clf.fit(X_train, y_train)             # internal scaler fits on X_train
y_pred = clf.predict(X_test)          # internal scaler transforms X_test

The same example appearing in both sets

This is subtle and easy to do accidentally:

Duplicates in the dataset. If the same row appears twice and one copy ends up in train and the other in test, the test number is inflated.
Strong temporal structure. Time-series data split randomly puts consecutive samples (practically the same recording) in both train and test. The model didn’t learn to predict the future, it learned to predict near-duplicates of its training data.
Group structure. With multiple samples from the same subject (multiple ECG recordings from the same patient), randomly splitting individual samples puts samples from one subject in both train and test. The model learns subject-specific patterns and looks good on test data even though it would fail on truly new subjects.

The fix depends on the structure: deduplicate before splitting, split temporally (training on early data, testing on later) for time series, use GroupKFold or split by subject ID for grouped data.

Why it matters

Without leakage, the test score is an honest estimate of how the model will perform on new data. With leakage, the test score is inflated, sometimes dramatically. Models that look great in evaluation and fail in production often have leakage in the development pipeline, the most common reason a deployed model performs worse than expected.

Leakage is part of why machine-learning evaluation has so much structure: distinct training, validation, and test sets; preprocessing fit only on training; pipelines that enforce the discipline automatically. Each piece prevents some category of leakage.

Idriss Rami — Notes

Explorer

Data leakage

Preprocessing on the whole dataset before splitting

The same example appearing in both sets

Why it matters

Graph View

Table of Contents

Backlinks