Tracking Data Drift in Production
Data drift occurs when the statistical properties of production inputs diverge from the training distribution — silently degrading model accuracy without any obvious error signals until it's too late.
Types of Drift
There are several distinct flavours of drift, each requiring different detection strategies.
Drift Taxonomy
- Data/Feature Drift (covariate shift): p(X) changes but p(y|X) stays the same — inputs shift but the relationship holds
- Label Drift (prior probability shift): p(y) changes — the target class balance shifts
- Concept Drift: p(y|X) changes — the fundamental relationship between features and labels changes
- Prediction Drift: p(\\hat{y}) changes — a proxy for concept drift when labels are delayed
Statistical Tests for Drift Detection
Common drift detection methods compare reference (training) and production (current) distributions using statistical tests.
Kolmogorov-Smirnov and Chi-Squared Tests
<pre><code class="language-python">from scipy import stats
import numpy as np
# Simulate reference and production samples
ref = np.random.normal(0, 1, 1000) # training distribution
prod = np.random.normal(0.5, 1.2, 500) # shifted production data
# K-S test for continuous features
ks_stat, p_value = stats.ks_2samp(ref, prod)
print(f"KS Statistic: {ks_stat:.4f}, p-value: {p_value:.4f}")
if p_value < 0.05:
print("Drift detected!")
else:
print("No significant drift.")
# Chi-Squared test for categorical features
from scipy.stats import chi2_contingency
ref_counts = np.array([600, 300, 100])
prod_counts = np.array([400, 350, 250])
chi2, p, dof, expected = chi2_contingency([ref_counts, prod_counts])
print(f"Chi2 p-value: {p:.4f}")</pre>
Monitoring with Evidently AI
<pre><code class="language-python">from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
import pandas as pd
# reference_df: your training data
# production_df: a recent window of production data
reference_df = pd.read_csv("train_data.csv")
production_df = pd.read_csv("production_window.csv")
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=reference_df, current_data=production_df)
report.save_html("drift_report.html")
# Open drift_report.html for a full interactive drift analysis</pre>
Responding to Drift
Detecting drift is only half the solution — you also need a playbook for what happens next.
Drift Response Strategies
- Alert: Send a notification to the model owner for investigation
- Retrain: Trigger a new training run on recent data if performance has degraded
- Fallback: Route traffic to a simpler, more robust model
- Feature engineering review: Investigate whether upstream data pipelines have changed
- Window-based retraining: Automatically retrain on a rolling window of recent data on a schedule