Shadow Mode Deployment
Shadow mode deployment runs a new candidate model alongside the current production model on live traffic — the production model's response is served to users while the candidate's predictions are logged and compared, all without any user-facing risk.
How Shadow Mode Works
Every incoming request is simultaneously routed to both models. The primary model's prediction is returned to the user. The shadow model's prediction is captured asynchronously for analysis — it has no effect on the user experience.
Shadow Mode Architecture
- Traffic Mirroring Layer: A proxy (nginx, Envoy, or application-level) duplicates each request
- Primary Model: Handles the request normally and returns a response
- Shadow Model: Receives the mirrored request, runs inference asynchronously
- Logging Store: Captures shadow predictions, latency, and errors for offline comparison
Application-Level Shadow in Python
Analysing Shadow Results
Collect shadow logs for several days or weeks, then compare shadow vs. primary predictions and, once ground truth labels arrive, compare accuracy, recall, and error distributions.
Prediction Agreement Analysis
Promoting from Shadow to Production
Shadow deployment de-risks promotion: only proceed when the shadow model demonstrates superior or equal quality metrics with acceptable latency and error rates.
Promotion Checklist
- Shadow model accuracy ≥ primary model on held-out ground truth
- P95 latency within acceptable SLA (e.g., <200ms)
- Error rate <0.1% over a statistically significant request volume
- Data drift metrics stable between shadow input and training distribution
- Rollback plan documented and tested before full cutover