A/B Testing Deployed Models
A/B testing exposes different user segments to different model versions simultaneously, collecting outcome data to determine — with statistical confidence — whether the new model genuinely improves business metrics.
A/B Test Design
A valid A/B test requires a clear hypothesis, a single primary metric, pre-specified sample size (power analysis), and random assignment of users to groups.
Traffic Splitting Strategy
<pre><code class="language-python">import hashlib
def assign_variant(user_id: str, traffic_pct: float = 0.2) -> str:
"""
Deterministically assign a user to A (control) or B (treatment).
traffic_pct: fraction of users in the B group.
"""
hash_val = int(hashlib.md5(user_id.encode()).hexdigest(), 16)
bucket = (hash_val % 100) / 100.0
return "B" if bucket < traffic_pct else "A"
# Usage
for uid in ["user_001", "user_002", "user_003"]:
print(uid, "->", assign_variant(uid, traffic_pct=0.1))</pre>
Power Analysis: Minimum Sample Size
<pre><code class="language-python">from statsmodels.stats.power import NormalIndPower
analysis = NormalIndPower()
# Minimum samples per group to detect a 2% lift with 80% power at alpha=0.05
n = analysis.solve_power(
effect_size=0.02 / 0.10, # lift / baseline_std (Cohen's d)
alpha=0.05,
power=0.80,
alternative="larger"
)
print(f"Required samples per group: {int(n):,}")</pre>
Analysing A/B Test Results
After collecting enough data, use a two-sample proportion z-test (for binary metrics) or t-test (for continuous metrics) to determine statistical significance.
Two-Proportion Z-Test
<pre><code class="language-python">from statsmodels.stats.proportion import proportions_ztest
import numpy as np
# Clicks / conversions observed in each group
conversions = np.array([520, 580]) # A: 520, B: 580
nobs = np.array([5000, 5000]) # 5000 users per group
stat, p_value = proportions_ztest(conversions, nobs, alternative="smaller")
print(f"Z-statistic: {stat:.4f}")
print(f"p-value: {p_value:.4f}")
if p_value < 0.05:
print("Model B is significantly better — promote!")
else:
print("No significant difference — keep Model A.")</pre>
Confidence Intervals for the Lift
<pre><code class="language-python">from statsmodels.stats.proportion import proportion_confint
cr_a = 520 / 5000
cr_b = 580 / 5000
ci_a = proportion_confint(520, 5000, alpha=0.05)
ci_b = proportion_confint(580, 5000, alpha=0.05)
print(f"Model A conversion rate: {cr_a:.2%} (95% CI: {ci_a[0]:.2%} - {ci_a[1]:.2%})")
print(f"Model B conversion rate: {cr_b:.2%} (95% CI: {ci_b[0]:.2%} - {ci_b[1]:.2%})")
print(f"Relative lift: {(cr_b - cr_a) / cr_a:.2%}")</pre>
Common A/B Testing Pitfalls
Poorly designed A/B tests can produce misleading conclusions — equally damaging as having no test at all.
Pitfalls to Avoid
- Peeking: Stopping early when results look significant inflates false positive rates — wait for the pre-determined sample size
- Multiple metrics: Testing many metrics simultaneously requires a Bonferroni or Benjamini-Hochberg correction
- Novelty effect: Users may engage more with anything new; run tests long enough for the effect to stabilise
- Interference: If user A's model affects user B's experience (social networks, pricing), standard tests are invalid — use cluster randomisation instead