Introduction to Scikit-Learn Pipeline Objects
A scikit-learn Pipeline chains multiple processing steps — such as scaling, feature selection, and classification — into a single object that behaves like a standard estimator, preventing data leakage and simplifying deployment.
Building a Basic Pipeline
A pipeline is constructed from a list of (name, estimator) tuples. All steps except the last must implement transform(); the last step must implement fit().
Pipeline with Scaling and Classification
<pre><code class="language-python">from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
pipe = Pipeline([
("scaler", StandardScaler()),
("svc", SVC(kernel="rbf", C=1.0, probability=True))
])
pipe.fit(X_train, y_train)
print(classification_report(y_test, pipe.predict(X_test)))</pre>
Using make_pipeline for Convenience
<pre><code class="language-python">from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
# make_pipeline auto-names each step from the class name
pipe = make_pipeline(MinMaxScaler(), LogisticRegression(max_iter=500))
pipe.fit(X_train, y_train)
print(pipe.named_steps) # {'minmaxscaler': ..., 'logisticregression': ...}</pre>
Why Pipelines Prevent Data Leakage
When you use a pipeline inside cross_val_score, all preprocessing (e.g., fit_transform on the scaler) happens inside each training fold — the validation fold never influences any fitted transformer.
Cross-Validation with a Pipeline
<pre><code class="language-python">from sklearn.model_selection import cross_val_score
scores = cross_val_score(pipe, X, y, cv=5, scoring="accuracy")
print(f"Pipeline CV Accuracy: {scores.mean():.4f}")
# The scaler is re-fit from scratch on each training fold — no leakage</pre>
Hyperparameter Tuning Through Pipelines
Pipeline steps expose their parameters with a double-underscore syntax (stepname__param), allowing seamless integration with GridSearchCV.
GridSearchCV Over a Pipeline
<pre><code class="language-python">from sklearn.model_selection import GridSearchCV
param_grid = {
"svc__C": [0.1, 1, 10],
"svc__kernel": ["rbf", "linear"]
}
grid = GridSearchCV(pipe, param_grid, cv=5, scoring="accuracy")
grid.fit(X_train, y_train)
print("Best params:", grid.best_params_)
print("Best score:", grid.best_score_)</pre>