A scikit-learn pipeline chains preprocessing and modelling steps into a single object that handles them in order. Each step is a transformer (which has .fit() and .transform()) or a model (the final step, with .fit() and .predict()). From the outside the pipeline acts like a single model: .fit() and .predict() work the same as on a bare estimator, but internally it walks through each step in sequence.
The standard idiom is make_pipeline(...):
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
clf = make_pipeline(StandardScaler(), LogisticRegression(max_iter=10000))
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)When we call clf.fit(X_train, y_train):
- The
StandardScalerfits onX_train(computing per-column means and standard deviations). - The scaler transforms
X_traininto a normalized array. - The
LogisticRegressionfits on the normalized training data.
When we call clf.predict(X_test):
- The (already-fit)
StandardScalertransformsX_testusing the training-set statistics. - The (already-fit)
LogisticRegressionpredicts on the normalized test data.
Why this matters: data leakage prevention
The pipeline pattern is the canonical way to avoid Data leakage in scikit-learn. The scaler never sees the test data during its fit(). It’s fit on training data only, and the stored parameters get reused for test data. No way for test-set information to slip into the preprocessing.
The alternative, fitting the scaler manually, transforming both training and test, then fitting the classifier, is also fine if done carefully. But the pipeline enforces the discipline automatically. One less thing to get wrong.
Beyond two steps
A pipeline can have any number of steps. The last step is a model; all earlier steps are transformers:
clf = make_pipeline(
SimpleImputer(strategy='median'),
StandardScaler(),
PolynomialFeatures(degree=2),
LogisticRegression()
)make_pipeline auto-names the steps from the class names (lowercase). For custom names, use Pipeline([('imputer', ...), ('scaler', ...), ...]) instead.
Inspecting fitted steps
The fitted scaler, polynomial features, etc., are accessible from the pipeline:
clf.named_steps['standardscaler'].mean_ # column means from training
clf.named_steps['logisticregression'].coef_ # learned weightsWith cross-validation
Pipelines compose cleanly with cross-validation and hyperparameter search:
from sklearn.model_selection import cross_val_score, GridSearchCV
# Cross-validation
scores = cross_val_score(clf, X, y, cv=5)
# Hyperparameter grid search
search = GridSearchCV(
clf,
param_grid={'logisticregression__C': [0.01, 0.1, 1, 10, 100]},
cv=5
)The double-underscore in logisticregression__C lets us target a step’s hyperparameter through the pipeline. In each cross-validation fold, the pipeline refits the scaler on that fold’s training data, exactly what we want, with no leakage between folds.