Introduction to Scikit-Learn Pipeline Objects

A scikit-learn Pipeline chains multiple processing steps — such as scaling, feature selection, and classification — into a single object that behaves like a standard estimator, preventing data leakage and simplifying deployment.


Building a Basic Pipeline

A pipeline is constructed from a list of (name, estimator) tuples. All steps except the last must implement transform(); the last step must implement fit().

Pipeline with Scaling and Classification

<pre><code class="language-python">from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.svm import SVC from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report X, y = load_breast_cancer(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) pipe = Pipeline([ ("scaler", StandardScaler()), ("svc", SVC(kernel="rbf", C=1.0, probability=True)) ]) pipe.fit(X_train, y_train) print(classification_report(y_test, pipe.predict(X_test)))</pre>

Using make_pipeline for Convenience

<pre><code class="language-python">from sklearn.pipeline import make_pipeline from sklearn.preprocessing import MinMaxScaler from sklearn.linear_model import LogisticRegression # make_pipeline auto-names each step from the class name pipe = make_pipeline(MinMaxScaler(), LogisticRegression(max_iter=500)) pipe.fit(X_train, y_train) print(pipe.named_steps) # {'minmaxscaler': ..., 'logisticregression': ...}</pre>

Why Pipelines Prevent Data Leakage

When you use a pipeline inside cross_val_score, all preprocessing (e.g., fit_transform on the scaler) happens inside each training fold — the validation fold never influences any fitted transformer.

Cross-Validation with a Pipeline

<pre><code class="language-python">from sklearn.model_selection import cross_val_score scores = cross_val_score(pipe, X, y, cv=5, scoring="accuracy") print(f"Pipeline CV Accuracy: {scores.mean():.4f}") # The scaler is re-fit from scratch on each training fold — no leakage</pre>

Hyperparameter Tuning Through Pipelines

Pipeline steps expose their parameters with a double-underscore syntax (stepname__param), allowing seamless integration with GridSearchCV.

GridSearchCV Over a Pipeline

<pre><code class="language-python">from sklearn.model_selection import GridSearchCV param_grid = { "svc__C": [0.1, 1, 10], "svc__kernel": ["rbf", "linear"] } grid = GridSearchCV(pipe, param_grid, cv=5, scoring="accuracy") grid.fit(X_train, y_train) print("Best params:", grid.best_params_) print("Best score:", grid.best_score_)</pre>