A scikit-learn pipeline chains preprocessing and modelling steps into a single object that handles them in order. Each step is a transformer (which has .fit() and .transform()) or a model (the final step, with .fit() and .predict()). From the outside the pipeline acts like a single model: .fit() and .predict() work the same as on a bare estimator, but internally it walks through each step in sequence.

The standard idiom is make_pipeline(...):

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
 
clf = make_pipeline(StandardScaler(), LogisticRegression(max_iter=10000))
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

When we call clf.fit(X_train, y_train):

  1. The StandardScaler fits on X_train (computing per-column means and standard deviations).
  2. The scaler transforms X_train into a normalized array.
  3. The LogisticRegression fits on the normalized training data.

When we call clf.predict(X_test):

  1. The (already-fit) StandardScaler transforms X_test using the training-set statistics.
  2. The (already-fit) LogisticRegression predicts on the normalized test data.

Why this matters: data leakage prevention

The pipeline pattern is the canonical way to avoid Data leakage in scikit-learn. The scaler never sees the test data during its fit(). It’s fit on training data only, and the stored parameters get reused for test data. No way for test-set information to slip into the preprocessing.

The alternative, fitting the scaler manually, transforming both training and test, then fitting the classifier, is also fine if done carefully. But the pipeline enforces the discipline automatically. One less thing to get wrong.

Beyond two steps

A pipeline can have any number of steps. The last step is a model; all earlier steps are transformers:

clf = make_pipeline(
    SimpleImputer(strategy='median'),
    StandardScaler(),
    PolynomialFeatures(degree=2),
    LogisticRegression()
)

make_pipeline auto-names the steps from the class names (lowercase). For custom names, use Pipeline([('imputer', ...), ('scaler', ...), ...]) instead.

Inspecting fitted steps

The fitted scaler, polynomial features, etc., are accessible from the pipeline:

clf.named_steps['standardscaler'].mean_     # column means from training
clf.named_steps['logisticregression'].coef_ # learned weights

With cross-validation

Pipelines compose cleanly with cross-validation and hyperparameter search:

from sklearn.model_selection import cross_val_score, GridSearchCV
 
# Cross-validation
scores = cross_val_score(clf, X, y, cv=5)
 
# Hyperparameter grid search
search = GridSearchCV(
    clf,
    param_grid={'logisticregression__C': [0.01, 0.1, 1, 10, 100]},
    cv=5
)

The double-underscore in logisticregression__C lets us target a step’s hyperparameter through the pipeline. In each cross-validation fold, the pipeline refits the scaler on that fold’s training data, exactly what we want, with no leakage between folds.