Technique that stores the internal computation of reused prompt prefixes across LLM calls, reducing costs by up to 90% and latency by up to 85% in applications with repetitive context.
Prompt caching is an optimization offered by LLM providers that stores the internal computation (attention states) of prompt prefixes that repeat across API calls. Instead of reprocessing thousands of identical tokens on every request, the model reuses the previous computation and only processes the new tokens.
Unlike traditional software caching — which stores outputs like HTTP responses or query results — prompt caching stores processed inputs, because LLM outputs are dynamic and vary with each generation.
The process follows three steps:
The cache has a limited time window — typically 5 to 10 minutes of inactivity before expiring.
| Provider | Type | Minimum tokens | Discount | Latency |
|---|---|---|---|---|
| Anthropic (Claude) | Explicit — requires marking blocks with cache_control | 1,024 | 90% on cached tokens | Up to 85% less |
| OpenAI (GPT-4o, o1) | Automatic — no code changes | 1,024 | 50% on cached tokens | Variable reduction |
| Google (Gemini) | Explicit — requires manual configuration | Variable | Up to 75% | Variable |
| DeepSeek | Automatic | 1,024 | Up to 90% | Variable |
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=[{
"type": "text",
"text": long_document, # Thousands of tokens
"cache_control": {"type": "ephemeral"}
}],
messages=[{"role": "user", "content": question}]
)The block marked with cache_control is stored after the first call. Subsequent calls with the same prefix pay only the cache read price.
Prompt caching is most effective when:
For an agent that iterates 10 times with a 4,000-token system prompt and 2,000-token tool definitions:
| Scenario | Input tokens processed | Relative cost |
|---|---|---|
| No cache | 10 × 6,000 = 60,000 | 100% |
| With cache (Anthropic) | 6,000 + 9 × 600 = 11,400 | ~19% |
| With cache (OpenAI) | 6,000 + 9 × 3,000 = 33,000 | ~55% |
Savings scale with the number of iterations and prefix size. In RAG pipelines where the same document is analyzed with multiple questions, the pattern is identical.
In AI applications with repetitive context — agents, RAG, document analysis — input token cost dominates the bill. Prompt caching turns a linear expense into a nearly constant one: the first request pays full price, but subsequent ones pay a fraction. For an agent that iterates 10 times with the same system prompt and tools, the difference can be 10x in cost.
cache_control examples.Techniques to reduce cost, latency, and resources needed to run language models in production, from quantization to distributed serving.
The maximum number of tokens an LLM can process in a single interaction, determining how much information it can consider simultaneously to generate responses.
Practices and strategies to minimize cloud spending without sacrificing performance, including right-sizing, reservations, spot instances, and eliminating idle resources.
Massive neural networks based on the Transformer architecture, trained on enormous text corpora to understand and generate natural language with emergent capabilities like reasoning, translation, and code generation.
Design patterns where AI agents execute complex multi-step tasks autonomously, combining reasoning, tool use, and iterative decision-making.