Handling Categorical Data: One-Hot Encoding
One-Hot Encoding transforms a single categorical column with k unique values into k binary columns, each representing the presence or absence of one category. This eliminates any false ordinal relationship between categories.
One-Hot Encoding with sklearn and pandas
Both pandas' get_dummies and sklearn's OneHotEncoder produce one-hot representations, but sklearn's version integrates into pipelines and handles unseen categories at inference time.
Using pandas get_dummies
<pre><code class="language-python">import pandas as pd
df = pd.DataFrame({"color": ["red", "blue", "green", "red"]})
df_encoded = pd.get_dummies(df, columns=["color"], drop_first=True)
print(df_encoded)
# drop_first=True drops one category to avoid dummy variable trap</pre>
Using sklearn OneHotEncoder
<pre><code class="language-python">from sklearn.preprocessing import OneHotEncoder
import numpy as np
enc = OneHotEncoder(drop="first", sparse_output=False, handle_unknown="ignore")
X_train_enc = enc.fit_transform(X_train[["city", "color"]])
X_test_enc = enc.transform(X_test[["city", "color"]])
print(enc.get_feature_names_out()) # column names</pre>
High-Cardinality Categories
When a column has hundreds of unique values (e.g., zip codes), standard one-hot encoding produces an extremely wide and sparse matrix that hurts model performance. Alternative strategies include target encoding, frequency encoding, or embedding layers.
Frequency Encoding for High Cardinality
<pre><code class="language-python">freq_map = df["zip_code"].value_counts(normalize=True)
df["zip_code_freq"] = df["zip_code"].map(freq_map)
# Replace raw category with its relative frequency — compact and often effective</pre>