Log Transformations for Skewed Data
Many real-world features — income, house prices, website traffic — follow right-skewed distributions where the log transformation compresses the long tail, reducing the influence of extreme values and bringing the distribution closer to normal.
Applying Log Transformations
The natural log np.log and np.log1p (log(1+x), safe for zeros) are the most common. Apply the transformation before splitting and scaling.
log and log1p in Practice
Power Transforms: Box-Cox and Yeo-Johnson
The Box-Cox and Yeo-Johnson transforms are data-driven power transforms that find the optimal exponent to make a feature as Gaussian as possible. Yeo-Johnson handles zero and negative values; Box-Cox requires strictly positive inputs.
Using PowerTransformer
When Transformation Helps
Log and power transforms improve linear and logistic regression models because those algorithms assume approximately normal input distributions. Tree-based models (Random Forests, XGBoost) are insensitive to monotonic transformations and see no benefit.