The Danger of Data Leakage
Data leakage occurs when information from outside the training set — including future data or the target itself — contaminate the model, producing artificially inflated validation scores that collapse on real-world deployment. It is one of the most insidious and common ML mistakes.
Types of Data Leakage
There are two primary forms: target leakage (a feature is derived from or correlated with the target after the prediction time) and train-test contamination (fitting preprocessors on the full dataset before splitting).
Target Leakage Example
Imagine predicting whether a customer will churn. Including the feature "number of cancellation calls" is target leakage — only customers who already decided to churn make those calls. At prediction time, this feature doesn't exist yet for the customers you want to flag.
Train-Test Contamination
Prevention: Use sklearn Pipelines
sklearn's Pipeline eliminates train-test contamination by design — all preprocessing steps call fit_transform only on training folds during cross-validation and transform only on validation/test folds.