ColumnTransformer: The Modern Approach
ColumnTransformer applies specified transformers to named column subsets and concatenates the outputs, making it the standard way to handle mixed-type (numeric + categorical) datasets.
Numeric and Categorical Preprocessing
<pre><code class="language-python">from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
# Sample mixed-type dataset
data = pd.DataFrame({
"age": [25, 32, None, 45],
"income": [50000, 80000, 60000, None],
"gender": ["M", "F", "F", "M"],
"city": ["NY", "LA", "NY", "SF"]
})
y = [0, 1, 0, 1]
numeric_features = ["age", "income"]
categorical_features = ["gender", "city"]
numeric_transformer = Pipeline([
("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler())
])
categorical_transformer = Pipeline([
("imputer", SimpleImputer(strategy="most_frequent")),
("onehot", OneHotEncoder(handle_unknown="ignore", sparse_output=False))
])
preprocessor = ColumnTransformer([
("num", numeric_transformer, numeric_features),
("cat", categorical_transformer, categorical_features)
])
# Final pipeline
pipe = Pipeline([
("preprocessor", preprocessor),
("classifier", RandomForestClassifier(random_state=42))
])
pipe.fit(data, y)
print(pipe.predict(data))</pre>
FeatureUnion: Combining Parallel Transformers
FeatureUnion applies multiple transformations to the same input and horizontally concatenates the resulting feature arrays — useful for combining different feature extraction strategies.
FeatureUnion Example
<pre><code class="language-python">from sklearn.pipeline import FeatureUnion
from sklearn.decomposition import PCA
from sklearn.preprocessing import PolynomialFeatures
from sklearn.datasets import load_digits
X, y = load_digits(return_X_y=True)
# Combine PCA features with raw polynomial features
union = FeatureUnion([
("pca", PCA(n_components=20)),
("poly", PolynomialFeatures(degree=2, include_bias=False))
])
pipe = Pipeline([
("scaler", StandardScaler()),
("union", union),
("clf", RandomForestClassifier(n_estimators=50, random_state=42))
])
pipe.fit(X[:1000], y[:1000])
print(pipe.score(X[1000:], y[1000:]))</pre>
Tuning FeatureUnion Parameters
<pre><code class="language-python">from sklearn.model_selection import GridSearchCV
param_grid = {
"union__pca__n_components": [10, 20, 30],
"clf__n_estimators": [50, 100]
}
grid = GridSearchCV(pipe, param_grid, cv=3)
grid.fit(X, y)
print(grid.best_params_)</pre>
Choosing Between FeatureUnion and ColumnTransformer
Prefer ColumnTransformer for most real-world tabular tasks (it handles different column subsets). Use FeatureUnion when you want to apply multiple different strategies to the same features and combine the outputs.
Summary Comparison
- ColumnTransformer: Different transformers on different columns → concatenated output
- FeatureUnion: Multiple transformers on the same columns → concatenated output
- Both can be nested inside Pipelines and tuned with
GridSearchCV ColumnTransformer is generally preferred for tabular data; FeatureUnion is common in NLP feature combination