vLLM vs TensorRT vs Inferex Benchmark

The inference serving landscape has fragmented dramatically in 2025. vLLM, TensorRT-LLM, OpenLLM, Triton Inference Server, and now Inferex — engineers are drowning in options with no independent benchmark data. We're publishing ours. Yes, we have an obvious interest in this benchmark. Read it critically.

Methodology

All tests run on a single NVIDIA A100 80GB SXM4, Ubuntu 22.04, CUDA 12.3, driver 545.23. Models: Llama 3 8B and Llama 3 70B. Workloads: synthetic load with 512-token inputs and 256-token outputs, Poisson arrival process at target request rates. Each benchmark run: 10-minute warmup, 30-minute measurement window. P99 latency reported at 90th percentile of load the system could sustain without queue growth.

Llama 3 8B — Single A100 80GB

Llama 3 70B — Single A100 80GB

(Model requires tensor parallelism across multiple GPUs for FP16. All benchmarks at INT4/INT8 to fit on single card.)

Setup Complexity

Where vLLM Wins

vLLM is the easiest entry point and has the best community support. If you're prototyping or running a low-traffic service under 100 req/s, vLLM is the right choice. The gap to Inferex at low load is small; the simplicity advantage is real.

Where TensorRT-LLM Wins

If you are 100% NVIDIA GPU, need max hardware utilization, and have engineering bandwidth to maintain the toolchain, TensorRT-LLM is a strong choice. It outperforms vLLM significantly and approaches Inferex performance on some workloads.

Where Inferex Wins

Multi-hardware environments (GPU + CPU + edge). High-concurrency workloads where every millisecond counts. Teams without ML infrastructure expertise who want optimization without operational complexity. Observability and auto-scaling included rather than bolted on.

← Cost Reduction Strategies Back to Blog →