GPU vs CPU vs Edge: Choosing the Right Inference Hardware

There is no universal answer to "should I use GPU or CPU for inference?" The right answer depends on your model size, latency requirements, batch patterns, and cost constraints. Here's the framework we use at Inferex when onboarding new customers — and the benchmarks behind it.

The Three Hardware Categories

GPU: NVIDIA A100 / H100

The default choice for large model inference (>7B parameters). High memory bandwidth (2TB/s on H100) enables fast KV-cache reads. Best for: batch sizes >4, models above 7B, latency targets under 20ms.

Cost: $2–8/hour cloud, $30k–80k upfront. When to avoid: models under 1B parameters, ultra-low-power edge deployments, batch size consistently 1.

CPU: Intel Xeon / AMD EPYC

Underrated for inference. With INT8 quantization and AVX-512 vectorization, modern Xeon Platinum cores can serve small/medium models at competitive latency. Cost is 10–50x lower than GPU. Best for: models under 3B parameters, batch size 1–2, cost-sensitive deployments.

Inferex CPU optimization adds AMX tile acceleration on 4th-gen Xeon (Sapphire Rapids), typically doubling throughput vs. vanilla PyTorch.

Edge TPU / NPU

Google Edge TPU, Apple Neural Engine, Qualcomm Hexagon — these are purpose-built for inference at the edge. Fixed-function silicon achieves 5–50 TOPS at 1–5W. Best for: mobile/IoT, latency under 5ms, privacy-sensitive on-device inference.

Limitations: model size cap (typically 100MB–2GB), quantization required, limited operator support.

Benchmark Results

Llama 3 8B, FP8 quantized, batch size 1, 256-token input:

NVIDIA A100 40GB: 3.8ms P99, $3.20/hour, 340 req/s single node
Intel Xeon Platinum 8490H (2-socket): 18ms P99, $0.28/hour, 55 req/s single node
Qualcomm Cloud AI 100 Ultra: 5.2ms P99, $1.80/hour, 210 req/s single node

For Llama 3 8B at batch size 1, the Xeon is 11x cheaper per request than A100, at 4.7x higher latency. Whether that trade-off is acceptable depends entirely on your SLA.

The Decision Framework

Model > 13B AND latency < 20ms: GPU only
Model 3B–13B, latency 20–100ms: GPU or specialized accelerator
Model < 3B, latency > 50ms: CPU viable, significant cost savings
On-device / edge: Edge NPU if model fits, else quantized CPU
Batch size consistently > 8: GPU is almost always optimal

Inferex's hardware abstraction layer lets you deploy the same optimized model to any of these targets with a single configuration change. The kernel optimizations are hardware-specific under the hood — you don't need to think about it.