Monitoring Model Latency and Throughput
Beyond prediction accuracy, production ML systems must meet latency and throughput SLOs — systematic instrumentation with tools like Prometheus and Grafana makes performance problems visible before they affect users.
Key Performance Metrics
The four golden signals for any service (latency, traffic, errors, saturation) apply directly to ML APIs — with the addition of model-specific metrics like feature compute time and batch queue depth.
Metrics Definitions
- Latency (P50/P95/P99): Time from request receipt to response — percentile metrics matter more than averages
- Throughput (RPS): Requests per second your API can sustain
- Error Rate: Fraction of requests returning 4xx/5xx or invalid predictions
- Feature Compute Time: Time spent in preprocessing vs. model inference
- Saturation: CPU/GPU utilisation, memory pressure, queue depth
Instrumenting a FastAPI App with Prometheus
The prometheus-fastapi-instrumentator library auto-instruments a FastAPI app with request count, duration histograms, and a /metrics endpoint — compatible with any Prometheus scraping setup.
Adding Prometheus Metrics
Prometheus Scrape Config
Setting SLOs and Alerting
Service Level Objectives (SLOs) define what good performance looks like — alerts fire when observed metrics breach these thresholds, enabling proactive response before users notice.
Example SLO Definitions
- Availability: 99.9% of requests return a successful response per 30-day window
- Latency: P99 prediction latency <200ms, P50 <50ms
- Error budget: Maximum 43.8 minutes of downtime per month (0.1% of 30 days)
Set Grafana alerts to notify via Slack or PagerDuty when P99 latency exceeds 200ms for more than 5 consecutive minutes or error rate exceeds 1%.