Feature Unions in Pipelines

Pipelines process steps sequentially, but real-world feature engineering often requires applying different transformations to different feature subsets simultaneously — FeatureUnion and ColumnTransformer make this elegantly composable.


ColumnTransformer: The Modern Approach

ColumnTransformer applies specified transformers to named column subsets and concatenates the outputs, making it the standard way to handle mixed-type (numeric + categorical) datasets.

Numeric and Categorical Preprocessing

<pre><code class="language-python">from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.impute import SimpleImputer from sklearn.pipeline import Pipeline from sklearn.ensemble import RandomForestClassifier import pandas as pd # Sample mixed-type dataset data = pd.DataFrame({ "age": [25, 32, None, 45], "income": [50000, 80000, 60000, None], "gender": ["M", "F", "F", "M"], "city": ["NY", "LA", "NY", "SF"] }) y = [0, 1, 0, 1] numeric_features = ["age", "income"] categorical_features = ["gender", "city"] numeric_transformer = Pipeline([ ("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler()) ]) categorical_transformer = Pipeline([ ("imputer", SimpleImputer(strategy="most_frequent")), ("onehot", OneHotEncoder(handle_unknown="ignore", sparse_output=False)) ]) preprocessor = ColumnTransformer([ ("num", numeric_transformer, numeric_features), ("cat", categorical_transformer, categorical_features) ]) # Final pipeline pipe = Pipeline([ ("preprocessor", preprocessor), ("classifier", RandomForestClassifier(random_state=42)) ]) pipe.fit(data, y) print(pipe.predict(data))</pre>

FeatureUnion: Combining Parallel Transformers

FeatureUnion applies multiple transformations to the same input and horizontally concatenates the resulting feature arrays — useful for combining different feature extraction strategies.

FeatureUnion Example

<pre><code class="language-python">from sklearn.pipeline import FeatureUnion from sklearn.decomposition import PCA from sklearn.preprocessing import PolynomialFeatures from sklearn.datasets import load_digits X, y = load_digits(return_X_y=True) # Combine PCA features with raw polynomial features union = FeatureUnion([ ("pca", PCA(n_components=20)), ("poly", PolynomialFeatures(degree=2, include_bias=False)) ]) pipe = Pipeline([ ("scaler", StandardScaler()), ("union", union), ("clf", RandomForestClassifier(n_estimators=50, random_state=42)) ]) pipe.fit(X[:1000], y[:1000]) print(pipe.score(X[1000:], y[1000:]))</pre>

Tuning FeatureUnion Parameters

<pre><code class="language-python">from sklearn.model_selection import GridSearchCV param_grid = { "union__pca__n_components": [10, 20, 30], "clf__n_estimators": [50, 100] } grid = GridSearchCV(pipe, param_grid, cv=3) grid.fit(X, y) print(grid.best_params_)</pre>

Choosing Between FeatureUnion and ColumnTransformer

Prefer ColumnTransformer for most real-world tabular tasks (it handles different column subsets). Use FeatureUnion when you want to apply multiple different strategies to the same features and combine the outputs.

Summary Comparison

  • ColumnTransformer: Different transformers on different columns → concatenated output
  • FeatureUnion: Multiple transformers on the same columns → concatenated output
  • Both can be nested inside Pipelines and tuned with GridSearchCV
  • ColumnTransformer is generally preferred for tabular data; FeatureUnion is common in NLP feature combination