A/B Testing Deployed Models

A/B testing exposes different user segments to different model versions simultaneously, collecting outcome data to determine — with statistical confidence — whether the new model genuinely improves business metrics.


A/B Test Design

A valid A/B test requires a clear hypothesis, a single primary metric, pre-specified sample size (power analysis), and random assignment of users to groups.

Traffic Splitting Strategy

<pre><code class="language-python">import hashlib def assign_variant(user_id: str, traffic_pct: float = 0.2) -> str: """ Deterministically assign a user to A (control) or B (treatment). traffic_pct: fraction of users in the B group. """ hash_val = int(hashlib.md5(user_id.encode()).hexdigest(), 16) bucket = (hash_val % 100) / 100.0 return "B" if bucket < traffic_pct else "A" # Usage for uid in ["user_001", "user_002", "user_003"]: print(uid, "->", assign_variant(uid, traffic_pct=0.1))</pre>

Power Analysis: Minimum Sample Size

<pre><code class="language-python">from statsmodels.stats.power import NormalIndPower analysis = NormalIndPower() # Minimum samples per group to detect a 2% lift with 80% power at alpha=0.05 n = analysis.solve_power( effect_size=0.02 / 0.10, # lift / baseline_std (Cohen's d) alpha=0.05, power=0.80, alternative="larger" ) print(f"Required samples per group: {int(n):,}")</pre>

Analysing A/B Test Results

After collecting enough data, use a two-sample proportion z-test (for binary metrics) or t-test (for continuous metrics) to determine statistical significance.

Two-Proportion Z-Test

<pre><code class="language-python">from statsmodels.stats.proportion import proportions_ztest import numpy as np # Clicks / conversions observed in each group conversions = np.array([520, 580]) # A: 520, B: 580 nobs = np.array([5000, 5000]) # 5000 users per group stat, p_value = proportions_ztest(conversions, nobs, alternative="smaller") print(f"Z-statistic: {stat:.4f}") print(f"p-value: {p_value:.4f}") if p_value < 0.05: print("Model B is significantly better — promote!") else: print("No significant difference — keep Model A.")</pre>

Confidence Intervals for the Lift

<pre><code class="language-python">from statsmodels.stats.proportion import proportion_confint cr_a = 520 / 5000 cr_b = 580 / 5000 ci_a = proportion_confint(520, 5000, alpha=0.05) ci_b = proportion_confint(580, 5000, alpha=0.05) print(f"Model A conversion rate: {cr_a:.2%} (95% CI: {ci_a[0]:.2%} - {ci_a[1]:.2%})") print(f"Model B conversion rate: {cr_b:.2%} (95% CI: {ci_b[0]:.2%} - {ci_b[1]:.2%})") print(f"Relative lift: {(cr_b - cr_a) / cr_a:.2%}")</pre>

Common A/B Testing Pitfalls

Poorly designed A/B tests can produce misleading conclusions — equally damaging as having no test at all.

Pitfalls to Avoid

  • Peeking: Stopping early when results look significant inflates false positive rates — wait for the pre-determined sample size
  • Multiple metrics: Testing many metrics simultaneously requires a Bonferroni or Benjamini-Hochberg correction
  • Novelty effect: Users may engage more with anything new; run tests long enough for the effect to stabilise
  • Interference: If user A's model affects user B's experience (social networks, pricing), standard tests are invalid — use cluster randomisation instead