scikit-learn pipeline

A scikit-learn pipeline chains preprocessing and modelling steps into a single object that handles them in order. Each step is a transformer (which has .fit() and .transform()) or a model (the final step, with .fit() and .predict()). The pipeline acts like a single model from the outside — .fit() and .predict() work the same as on a bare estimator — but internally it walks through each step in sequence.

The standard idiom is make_pipeline(...):

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
 
clf = make_pipeline(StandardScaler(), LogisticRegression(max_iter=10000))
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

When we call clf.fit(X_train, y_train):

The StandardScaler fits on X_train (computing per-column means and standard deviations).
The scaler transforms X_train into a normalized array.
The LogisticRegression fits on the normalized training data.

When we call clf.predict(X_test):

The (already-fit) StandardScaler transforms X_test using the training-set statistics.
The (already-fit) LogisticRegression predicts on the normalized test data.

Why this matters: data leakage prevention

The pipeline pattern is the canonical way to avoid Data leakage in scikit-learn. The scaler never sees the test data during its fit() — it’s fit on training data only, and stored parameters are reused for test data. There’s no way for test-set information to slip into the preprocessing.

The alternative — fitting the scaler manually, transforming both training and test, then fitting the classifier — is also fine if done carefully, but the pipeline pattern enforces the discipline automatically. One less thing to get wrong.

Beyond two steps

A pipeline can have any number of steps. The last step is a model; all earlier steps are transformers:

clf = make_pipeline(
    SimpleImputer(strategy='median'),       # fill missing values
    StandardScaler(),                       # normalize
    PolynomialFeatures(degree=2),           # add quadratic features
    LogisticRegression()                    # classifier
)

make_pipeline auto-names the steps from the class names (lowercase). For custom names, use Pipeline([('imputer', ...), ('scaler', ...), ...]) instead.

Inspecting fitted steps

The fitted scaler, polynomial features, etc., are accessible from the pipeline:

clf.named_steps['standardscaler'].mean_     # column means from training
clf.named_steps['logisticregression'].coef_ # learned weights

With cross-validation

Pipelines compose cleanly with cross-validation and hyperparameter search:

from sklearn.model_selection import cross_val_score, GridSearchCV
 
# Cross-validation
scores = cross_val_score(clf, X, y, cv=5)
 
# Hyperparameter grid search
search = GridSearchCV(
    clf,
    param_grid={'logisticregression__C': [0.01, 0.1, 1, 10, 100]},
    cv=5
)

The double-underscore in logisticregression__C lets us target a step’s hyperparameter through the pipeline. In each cross-validation fold, the pipeline refits the scaler on that fold’s training data — exactly what we want, with no leakage between folds.

Idriss Rami — Notes

Explorer

scikit-learn pipeline

Why this matters: data leakage prevention

Beyond two steps

Inspecting fitted steps

With cross-validation

Graph View

Table of Contents

Backlinks