Building Custom Transformers in Scikit-Learn
When built-in transformers don't cover your preprocessing logic, you can build custom ones that integrate perfectly with Pipeline, GridSearchCV, and all of scikit-learn's tooling by subclassing BaseEstimator and TransformerMixin.
The Transformer API
A scikit-learn compatible transformer must implement three methods: fit(X, y=None), transform(X), and (free via TransformerMixin) fit_transform(X, y=None).
Minimal Custom Transformer
<pre><code class="language-python">from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np
class LogTransformer(BaseEstimator, TransformerMixin):
"""Apply log1p to selected features."""
def __init__(self, feature_indices=None):
self.feature_indices = feature_indices
def fit(self, X, y=None):
# Nothing to learn; return self for method chaining
return self
def transform(self, X):
X = X.copy()
if self.feature_indices is not None:
X[:, self.feature_indices] = np.log1p(X[:, self.feature_indices])
else:
X = np.log1p(X)
return X
# Usage
transformer = LogTransformer(feature_indices=[0, 2])
X_transformed = transformer.fit_transform(X_train)</pre>
Plugging Into a Pipeline
Because the custom transformer follows the scikit-learn API, it slots directly into Pipeline and supports hyperparameter tuning via the double-underscore syntax.
Full Pipeline with Custom Transformer
<pre><code class="language-python">from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
pipe = Pipeline([
("log", LogTransformer(feature_indices=[0, 1])),
("scaler", StandardScaler()),
("ridge", Ridge(alpha=1.0))
])
pipe.fit(X_train, y_train)
print(pipe.score(X_test, y_test))
# Tune the transformer's hyperparameters too
from sklearn.model_selection import GridSearchCV
param_grid = {"log__feature_indices": [[0, 1], [0, 1, 2], None],
"ridge__alpha": [0.1, 1.0, 10.0]}
grid = GridSearchCV(pipe, param_grid, cv=5)
grid.fit(X_train, y_train)</pre>
Quick Functional Transformers with FunctionTransformer
<pre><code class="language-python">from sklearn.preprocessing import FunctionTransformer
import numpy as np
# Wrap a plain function as a transformer (no state needed)
log_transformer = FunctionTransformer(np.log1p, validate=True)
pipe = Pipeline([
("log", log_transformer),
("scaler", StandardScaler()),
("ridge", Ridge())
])
pipe.fit(X_train, y_train)</pre>
Stateful Transformers
When your transformation requires statistics computed from the training data (e.g., custom imputation or learned scaling), store those statistics as attributes in fit() and apply them in transform().
Winsorizer Example (Stateful)
<pre><code class="language-python">class Winsorizer(BaseEstimator, TransformerMixin):
"""Clip values to [lower_pct, upper_pct] percentiles learned from training data."""
def __init__(self, lower=5, upper=95):
self.lower = lower
self.upper = upper
def fit(self, X, y=None):
self.lower_ = np.percentile(X, self.lower, axis=0)
self.upper_ = np.percentile(X, self.upper, axis=0)
return self
def transform(self, X):
return np.clip(X, self.lower_, self.upper_)</pre>