The Danger of Data Leakage

Data leakage occurs when information from outside the training set — including future data or the target itself — contaminate the model, producing artificially inflated validation scores that collapse on real-world deployment. It is one of the most insidious and common ML mistakes.


Types of Data Leakage

There are two primary forms: target leakage (a feature is derived from or correlated with the target after the prediction time) and train-test contamination (fitting preprocessors on the full dataset before splitting).

Target Leakage Example

Imagine predicting whether a customer will churn. Including the feature "number of cancellation calls" is target leakage — only customers who already decided to churn make those calls. At prediction time, this feature doesn't exist yet for the customers you want to flag.

Train-Test Contamination

<pre><code class="language-python"># WRONG: Scaler sees test data during fit scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # fitted on ALL data! X_train, X_test = train_test_split(X_scaled) # CORRECT: Fit only on training data X_train, X_test, y_train, y_test = train_test_split(X, y) scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # transform only</pre>

Prevention: Use sklearn Pipelines

sklearn's Pipeline eliminates train-test contamination by design — all preprocessing steps call fit_transform only on training folds during cross-validation and transform only on validation/test folds.

Pipeline as a Leakage Shield

<pre><code class="language-python">from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.model_selection import cross_val_score pipe = Pipeline([ ("scaler", StandardScaler()), ("clf", LogisticRegression()) ]) # cross_val_score applies fit_transform only on train folds scores = cross_val_score(pipe, X, y, cv=5, scoring="accuracy") print(scores.mean())</pre>