Monitoring Model Latency and Throughput

Beyond prediction accuracy, production ML systems must meet latency and throughput SLOs — systematic instrumentation with tools like Prometheus and Grafana makes performance problems visible before they affect users.

Key Performance Metrics

The four golden signals for any service (latency, traffic, errors, saturation) apply directly to ML APIs — with the addition of model-specific metrics like feature compute time and batch queue depth.

Metrics Definitions

Latency (P50/P95/P99): Time from request receipt to response — percentile metrics matter more than averages
Throughput (RPS): Requests per second your API can sustain
Error Rate: Fraction of requests returning 4xx/5xx or invalid predictions
Feature Compute Time: Time spent in preprocessing vs. model inference
Saturation: CPU/GPU utilisation, memory pressure, queue depth

Instrumenting a FastAPI App with Prometheus

The prometheus-fastapi-instrumentator library auto-instruments a FastAPI app with request count, duration histograms, and a /metrics endpoint — compatible with any Prometheus scraping setup.

Adding Prometheus Metrics

<pre><code class="language-python">from fastapi import FastAPI from prometheus_fastapi_instrumentator import Instrumentator import time from prometheus_client import Histogram app = FastAPI() # Auto-instrument all HTTP endpoints Instrumentator().instrument(app).expose(app) # Custom histogram for model inference time INFERENCE_TIME = Histogram( "model_inference_seconds", "Time spent running model prediction", buckets=[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0] ) @app.post("/predict") def predict(request: dict): start = time.perf_counter() result = ml_model["clf"].predict([request["features"]]) elapsed = time.perf_counter() - start INFERENCE_TIME.observe(elapsed) return {"prediction": int(result[0])}</pre>

Prometheus Scrape Config

<pre><code class="language-python"># prometheus.yml # scrape_configs: # - job_name: ml-api # scrape_interval: 15s # static_configs: # - targets: ["localhost:8000"] # Key Prometheus queries (PromQL): # P99 latency: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) # Request rate: rate(http_requests_total[5m]) # Error rate: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])</pre>

Setting SLOs and Alerting

Service Level Objectives (SLOs) define what good performance looks like — alerts fire when observed metrics breach these thresholds, enabling proactive response before users notice.

Example SLO Definitions

Availability: 99.9% of requests return a successful response per 30-day window
Latency: P99 prediction latency <200ms, P50 <50ms
Error budget: Maximum 43.8 minutes of downtime per month (0.1% of 30 days)

Set Grafana alerts to notify via Slack or PagerDuty when P99 latency exceeds 200ms for more than 5 consecutive minutes or error rate exceeds 1%.