There is no universal answer to "should I use GPU or CPU for inference?" The right answer depends on your model size, latency requirements, batch patterns, and cost constraints. Here's the framework we use at Inferex when onboarding new customers — and the benchmarks behind it.
The Three Hardware Categories
GPU: NVIDIA A100 / H100
The default choice for large model inference (>7B parameters). High memory bandwidth (2TB/s on H100) enables fast KV-cache reads. Best for: batch sizes >4, models above 7B, latency targets under 20ms.
Cost: $2–8/hour cloud, $30k–80k upfront. When to avoid: models under 1B parameters, ultra-low-power edge deployments, batch size consistently 1.
CPU: Intel Xeon / AMD EPYC
Underrated for inference. With INT8 quantization and AVX-512 vectorization, modern Xeon Platinum cores can serve small/medium models at competitive latency. Cost is 10–50x lower than GPU. Best for: models under 3B parameters, batch size 1–2, cost-sensitive deployments.
Inferex CPU optimization adds AMX tile acceleration on 4th-gen Xeon (Sapphire Rapids), typically doubling throughput vs. vanilla PyTorch.
Edge TPU / NPU
Google Edge TPU, Apple Neural Engine, Qualcomm Hexagon — these are purpose-built for inference at the edge. Fixed-function silicon achieves 5–50 TOPS at 1–5W. Best for: mobile/IoT, latency under 5ms, privacy-sensitive on-device inference.
Limitations: model size cap (typically 100MB–2GB), quantization required, limited operator support.
Benchmark Results
Llama 3 8B, FP8 quantized, batch size 1, 256-token input:
- NVIDIA A100 40GB: 3.8ms P99, $3.20/hour, 340 req/s single node
- Intel Xeon Platinum 8490H (2-socket): 18ms P99, $0.28/hour, 55 req/s single node
- Qualcomm Cloud AI 100 Ultra: 5.2ms P99, $1.80/hour, 210 req/s single node
For Llama 3 8B at batch size 1, the Xeon is 11x cheaper per request than A100, at 4.7x higher latency. Whether that trade-off is acceptable depends entirely on your SLA.
The Decision Framework
- Model > 13B AND latency < 20ms: GPU only
- Model 3B–13B, latency 20–100ms: GPU or specialized accelerator
- Model < 3B, latency > 50ms: CPU viable, significant cost savings
- On-device / edge: Edge NPU if model fits, else quantized CPU
- Batch size consistently > 8: GPU is almost always optimal
Inferex's hardware abstraction layer lets you deploy the same optimized model to any of these targets with a single configuration change. The kernel optimizations are hardware-specific under the hood — you don't need to think about it.