Inference Optimization

What it is

Inference optimization encompasses techniques to make LLMs faster, cheaper, and more efficient in production. While training happens once, inference happens millions of times — small improvements have massive impact.

Main techniques

Quantization

Reducing the numerical precision of model weights:

FP16: from 32-bit to 16-bit float (standard)
INT8: 8-bit integer (~2x memory reduction)
INT4/GPTQ/AWQ: 4-bit (~4x reduction, minimal quality loss)

KV Cache

Storing key-value pairs from attention layers to avoid recalculating them for each generated token. Essential for efficient autoregressive generation.

Batching

Static batching: grouping multiple requests into a batch
Continuous batching: dynamically adding/removing requests from the batch
Speculative decoding: a small model generates candidates that the large one verifies

Distributed serving

Tensor parallelism: splitting the model across multiple GPUs
Pipeline parallelism: different layers on different GPUs
Expert parallelism: for Mixture-of-Experts (MoE) models

Serving frameworks

Framework	Characteristics
vLLM	PagedAttention, continuous batching
TensorRT-LLM	NVIDIA optimization, high performance
Ollama	Local, easy to use
llama.cpp	CPU inference, aggressive quantization

Key metrics

TTFT (Time to First Token): latency until the first generated token
TPS (Tokens Per Second): generation speed
Throughput: requests per second for the system
Cost per token: inference price per generated token

Serverless inference

Services like AWS Bedrock and provider APIs eliminate infrastructure management. Pay per token consumed, ideal for variable workloads.

Why it matters

In production, inference happens millions of times. Every millisecond of latency and every processed token has a cost. Optimization techniques — quantization, KV cache, batching — are the difference between an economically viable AI system and one that breaks the budget.

References

Efficient Large Language Model Inference — Survey of optimization techniques.
vLLM: Easy, Fast, and Cheap LLM Serving — Kwon et al., 2023.
Efficient LLM Inference Survey — Zhou et al., 2024. Survey of inference optimization techniques.

What it is

Main techniques

Quantization

Reducing the numerical precision of model weights:

FP16: from 32-bit to 16-bit float (standard)
INT8: 8-bit integer (~2x memory reduction)
INT4/GPTQ/AWQ: 4-bit (~4x reduction, minimal quality loss)

KV Cache

Storing key-value pairs from attention layers to avoid recalculating them for each generated token. Essential for efficient autoregressive generation.

Batching

Static batching: grouping multiple requests into a batch
Continuous batching: dynamically adding/removing requests from the batch
Speculative decoding: a small model generates candidates that the large one verifies

Distributed serving

Tensor parallelism: splitting the model across multiple GPUs
Pipeline parallelism: different layers on different GPUs
Expert parallelism: for Mixture-of-Experts (MoE) models

Serving frameworks

Framework	Characteristics
vLLM	PagedAttention, continuous batching
TensorRT-LLM	NVIDIA optimization, high performance
Ollama	Local, easy to use
llama.cpp	CPU inference, aggressive quantization

Key metrics

TTFT (Time to First Token): latency until the first generated token
TPS (Tokens Per Second): generation speed
Throughput: requests per second for the system
Cost per token: inference price per generated token

Serverless inference

Services like AWS Bedrock and provider APIs eliminate infrastructure management. Pay per token consumed, ideal for variable workloads.

Why it matters

References

Efficient Large Language Model Inference — Survey of optimization techniques.
vLLM: Easy, Fast, and Cheap LLM Serving — Kwon et al., 2023.
Efficient LLM Inference Survey — Zhou et al., 2024. Survey of inference optimization techniques.

Inference Optimization

What it is

Main techniques

Quantization

KV Cache

Batching

Distributed serving

Serving frameworks

Key metrics

Serverless inference

Why it matters

References

Related content

Inference Optimization

What it is

Main techniques

Quantization

KV Cache

Batching

Distributed serving

Serving frameworks

Key metrics

Serverless inference

Why it matters

References

Related content