Log Transformations for Skewed Data

Many real-world features — income, house prices, website traffic — follow right-skewed distributions where the log transformation compresses the long tail, reducing the influence of extreme values and bringing the distribution closer to normal.


Applying Log Transformations

The natural log np.log and np.log1p (log(1+x), safe for zeros) are the most common. Apply the transformation before splitting and scaling.

log and log1p in Practice

<pre><code class="language-python">import numpy as np import pandas as pd import matplotlib.pyplot as plt df = pd.read_csv("data.csv") # log1p is safe for zero values df["log_income"] = np.log1p(df["income"]) fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4)) df["income"].hist(ax=ax1, bins=50, title="Original") df["log_income"].hist(ax=ax2, bins=50, title="Log-Transformed") plt.tight_layout() plt.show()</pre>

Power Transforms: Box-Cox and Yeo-Johnson

The Box-Cox and Yeo-Johnson transforms are data-driven power transforms that find the optimal exponent to make a feature as Gaussian as possible. Yeo-Johnson handles zero and negative values; Box-Cox requires strictly positive inputs.

Using PowerTransformer

<pre><code class="language-python">from sklearn.preprocessing import PowerTransformer pt = PowerTransformer(method="yeo-johnson", standardize=True) X_train_transformed = pt.fit_transform(X_train[["income", "spend"]]) X_test_transformed = pt.transform(X_test[["income", "spend"]]) print(pt.lambdas_) # optimal lambda per column</pre>

When Transformation Helps

Log and power transforms improve linear and logistic regression models because those algorithms assume approximately normal input distributions. Tree-based models (Random Forests, XGBoost) are insensitive to monotonic transformations and see no benefit.