Inference Optimization
Techniques to reduce cost, latency, and resources needed to run language models in production, from quantization to distributed serving.
What it is
Inference optimization encompasses techniques to make LLMs faster, cheaper, and more efficient in production. While training happens once, inference happens millions of times — small improvements have massive impact.
Main techniques
Quantization
Reducing the numerical precision of model weights:
- FP16: from 32-bit to 16-bit float (standard)
- INT8: 8-bit integer (~2x memory reduction)
- INT4/GPTQ/AWQ: 4-bit (~4x reduction, minimal quality loss)
KV Cache
Storing key-value pairs from attention layers to avoid recalculating them for each generated token. Essential for efficient autoregressive generation.
Batching
- Static batching: grouping multiple requests into a batch
- Continuous batching: dynamically adding/removing requests from the batch
- Speculative decoding: a small model generates candidates that the large one verifies
Distributed serving
- Tensor parallelism: splitting the model across multiple GPUs
- Pipeline parallelism: different layers on different GPUs
- Expert parallelism: for Mixture-of-Experts (MoE) models
Serving frameworks
| Framework | Characteristics |
|---|---|
| vLLM | PagedAttention, continuous batching |
| TensorRT-LLM | NVIDIA optimization, high performance |
| Ollama | Local, easy to use |
| llama.cpp | CPU inference, aggressive quantization |
Key metrics
- TTFT (Time to First Token): latency until the first generated token
- TPS (Tokens Per Second): generation speed
- Throughput: requests per second for the system
- Cost per token: inference price per generated token
Serverless inference
Services like AWS Bedrock and provider APIs eliminate infrastructure management. Pay per token consumed, ideal for variable workloads.
Why it matters
In production, inference happens millions of times. Every millisecond of latency and every processed token has a cost. Optimization techniques — quantization, KV cache, batching — are the difference between an economically viable AI system and one that breaks the budget.
References
- Efficient Large Language Model Inference — Survey of optimization techniques.
- vLLM: Easy, Fast, and Cheap LLM Serving — Kwon et al., 2023.
- Efficient LLM Inference Survey — Zhou et al., 2024. Survey of inference optimization techniques.