Shadow Mode Deployment

Shadow mode deployment runs a new candidate model alongside the current production model on live traffic — the production model's response is served to users while the candidate's predictions are logged and compared, all without any user-facing risk.

How Shadow Mode Works

Every incoming request is simultaneously routed to both models. The primary model's prediction is returned to the user. The shadow model's prediction is captured asynchronously for analysis — it has no effect on the user experience.

Shadow Mode Architecture

Traffic Mirroring Layer: A proxy (nginx, Envoy, or application-level) duplicates each request
Primary Model: Handles the request normally and returns a response
Shadow Model: Receives the mirrored request, runs inference asynchronously
Logging Store: Captures shadow predictions, latency, and errors for offline comparison

Application-Level Shadow in Python

<pre><code class="language-python">import asyncio import logging import joblib import numpy as np primary_model = joblib.load("model_v1.joblib") shadow_model = joblib.load("model_v2.joblib") async def shadow_predict(features: list): """Run shadow model asynchronously, log result, never affect response.""" try: X = np.array(features).reshape(1, -1) shadow_pred = shadow_model.predict(X)[0] logging.info("SHADOW|features=%s|prediction=%s", features, shadow_pred) except Exception as e: logging.error("Shadow model error: %s", e) def predict(features: list) -> int: X = np.array(features).reshape(1, -1) primary_pred = int(primary_model.predict(X)[0]) # Fire-and-forget shadow inference loop = asyncio.get_event_loop() loop.create_task(shadow_predict(features)) return primary_pred # user only sees primary result</pre>

Analysing Shadow Results

Collect shadow logs for several days or weeks, then compare shadow vs. primary predictions and, once ground truth labels arrive, compare accuracy, recall, and error distributions.

Prediction Agreement Analysis

<pre><code class="language-python">import pandas as pd from sklearn.metrics import cohen_kappa_score logs = pd.read_csv("shadow_logs.csv") # columns: primary_pred, shadow_pred, true_label # Prediction agreement agreement = (logs["primary_pred"] == logs["shadow_pred"]).mean() print(f"Prediction agreement: {agreement:.2%}") # Cohen's Kappa (agreement beyond chance) kappa = cohen_kappa_score(logs["primary_pred"], logs["shadow_pred"]) print(f"Cohen Kappa: {kappa:.4f}") # Accuracy comparison (when labels are available) from sklearn.metrics import accuracy_score print("Primary accuracy:", accuracy_score(logs["true_label"], logs["primary_pred"])) print("Shadow accuracy: ", accuracy_score(logs["true_label"], logs["shadow_pred"]))</pre>

Promoting from Shadow to Production

Shadow deployment de-risks promotion: only proceed when the shadow model demonstrates superior or equal quality metrics with acceptable latency and error rates.

Promotion Checklist

Shadow model accuracy ≥ primary model on held-out ground truth
P95 latency within acceptable SLA (e.g., <200ms)
Error rate <0.1% over a statistically significant request volume
Data drift metrics stable between shadow input and training distribution
Rollback plan documented and tested before full cutover