Concepts

Inference Optimization

Techniques to reduce cost, latency, and resources needed to run language models in production, from quantization to distributed serving.

seed#inference#optimization#quantization#latency#serving#llm#performance

What it is

Inference optimization encompasses techniques to make LLMs faster, cheaper, and more efficient in production. While training happens once, inference happens millions of times — small improvements have massive impact.

Main techniques

Quantization

Reducing the numerical precision of model weights:

  • FP16: from 32-bit to 16-bit float (standard)
  • INT8: 8-bit integer (~2x memory reduction)
  • INT4/GPTQ/AWQ: 4-bit (~4x reduction, minimal quality loss)

KV Cache

Storing key-value pairs from attention layers to avoid recalculating them for each generated token. Essential for efficient autoregressive generation.

Batching

  • Static batching: grouping multiple requests into a batch
  • Continuous batching: dynamically adding/removing requests from the batch
  • Speculative decoding: a small model generates candidates that the large one verifies

Distributed serving

  • Tensor parallelism: splitting the model across multiple GPUs
  • Pipeline parallelism: different layers on different GPUs
  • Expert parallelism: for Mixture-of-Experts (MoE) models

Serving frameworks

FrameworkCharacteristics
vLLMPagedAttention, continuous batching
TensorRT-LLMNVIDIA optimization, high performance
OllamaLocal, easy to use
llama.cppCPU inference, aggressive quantization

Key metrics

  • TTFT (Time to First Token): latency until the first generated token
  • TPS (Tokens Per Second): generation speed
  • Throughput: requests per second for the system
  • Cost per token: inference price per generated token

Serverless inference

Services like AWS Bedrock and provider APIs eliminate infrastructure management. Pay per token consumed, ideal for variable workloads.

Why it matters

In production, inference happens millions of times. Every millisecond of latency and every processed token has a cost. Optimization techniques — quantization, KV cache, batching — are the difference between an economically viable AI system and one that breaks the budget.

References

Concepts