A scikit-learn pipeline chains preprocessing and modelling steps into a single object that handles them in order. Each step is a transformer (which has .fit() and .transform()) or a model (the final step, with .fit() and .predict()). The pipeline acts like a single model from the outside — .fit() and .predict() work the same as on a bare estimator — but internally it walks through each step in sequence.
The standard idiom is make_pipeline(...):
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
clf = make_pipeline(StandardScaler(), LogisticRegression(max_iter=10000))
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)When we call clf.fit(X_train, y_train):
- The
StandardScalerfits onX_train(computing per-column means and standard deviations). - The scaler transforms
X_traininto a normalized array. - The
LogisticRegressionfits on the normalized training data.
When we call clf.predict(X_test):
- The (already-fit)
StandardScalertransformsX_testusing the training-set statistics. - The (already-fit)
LogisticRegressionpredicts on the normalized test data.
Why this matters: data leakage prevention
The pipeline pattern is the canonical way to avoid Data leakage in scikit-learn. The scaler never sees the test data during its fit() — it’s fit on training data only, and stored parameters are reused for test data. There’s no way for test-set information to slip into the preprocessing.
The alternative — fitting the scaler manually, transforming both training and test, then fitting the classifier — is also fine if done carefully, but the pipeline pattern enforces the discipline automatically. One less thing to get wrong.
Beyond two steps
A pipeline can have any number of steps. The last step is a model; all earlier steps are transformers:
clf = make_pipeline(
SimpleImputer(strategy='median'), # fill missing values
StandardScaler(), # normalize
PolynomialFeatures(degree=2), # add quadratic features
LogisticRegression() # classifier
)make_pipeline auto-names the steps from the class names (lowercase). For custom names, use Pipeline([('imputer', ...), ('scaler', ...), ...]) instead.
Inspecting fitted steps
The fitted scaler, polynomial features, etc., are accessible from the pipeline:
clf.named_steps['standardscaler'].mean_ # column means from training
clf.named_steps['logisticregression'].coef_ # learned weightsWith cross-validation
Pipelines compose cleanly with cross-validation and hyperparameter search:
from sklearn.model_selection import cross_val_score, GridSearchCV
# Cross-validation
scores = cross_val_score(clf, X, y, cv=5)
# Hyperparameter grid search
search = GridSearchCV(
clf,
param_grid={'logisticregression__C': [0.01, 0.1, 1, 10, 100]},
cv=5
)The double-underscore in logisticregression__C lets us target a step’s hyperparameter through the pipeline. In each cross-validation fold, the pipeline refits the scaler on that fold’s training data — exactly what we want, with no leakage between folds.