LLM Throughput Scaling

1.2 million inference requests per second. That's not a theoretical ceiling — it's what we measured in production on a 64-node A100 cluster running Llama 3 70B with Inferex. Here's the architecture that makes it possible.

The Throughput Problem

Single-node inference throughput is well-understood. The hard problem is scaling horizontally while maintaining low latency under variable load. Most teams hit one of three failure modes:

Inferex solves all three with a three-layer architecture: request routing, continuous batching, and predictive scaling.

Layer 1: Distributed Request Routing

Instead of a central load balancer, Inferex uses a gossip-based routing mesh. Each client holds a view of worker load, updated every 50ms via UDP broadcast. Routing decisions are made client-side — zero round-trips to a central coordinator. Overhead: 0.1ms per routing decision.

Layer 2: Continuous Batching

Traditional static batching — collect N requests, then process — wastes GPU cycles. Inferex uses continuous batching: requests are inserted into the batch mid-execution, as tokens complete. This keeps GPU utilization above 88% even under bursty traffic patterns.

Layer 3: Predictive Auto-Scaling

We train a lightweight ARIMA model on each customer's request time series. It forecasts demand 90 seconds ahead, triggering scale-out before the traffic wave arrives. Typical scale-out time (warm workers): 8 seconds. Cold start: 45 seconds (we maintain a warm pool).

Results at 1.2M req/s

What's Next

We're currently testing a prefill/decode disaggregation architecture that should push P99 under 6ms at 1M+ req/s by separating the compute-bound prefill phase from the memory-bound decode phase. Results in Q3 2026.

← Latency Optimization Guide Model Quantization →