Prompt Caching
Technique that stores the internal computation of reused prompt prefixes across LLM calls, reducing costs by up to 90% and latency by up to 85% in applications with repetitive context.
What it is
Prompt caching is an optimization offered by LLM providers that stores the internal computation (attention states) of prompt prefixes that repeat across API calls. Instead of reprocessing thousands of identical tokens on every request, the model reuses the previous computation and only processes the new tokens.
Unlike traditional software caching — which stores outputs like HTTP responses or query results — prompt caching stores processed inputs, because LLM outputs are dynamic and vary with each generation.
How it works
The process follows three steps:
- First call: the provider processes the full prompt and stores the internal state of the prefix
- Subsequent calls: the system detects the prefix matches a stored one and reuses the computation
- Only the new part: the model processes only the tokens that differ from the cached prefix
The cache has a limited time window — typically 5 to 10 minutes of inactivity before expiring.
Provider implementations
| Provider | Type | Minimum tokens | Discount | Latency |
|---|---|---|---|---|
| Anthropic (Claude) | Explicit — requires marking blocks with cache_control | 1,024 | 90% on cached tokens | Up to 85% less |
| OpenAI (GPT-4o, o1) | Automatic — no code changes | 1,024 | 50% on cached tokens | Variable reduction |
| Google (Gemini) | Explicit — requires manual configuration | Variable | Up to 75% | Variable |
| DeepSeek | Automatic | 1,024 | Up to 90% | Variable |
Anthropic example
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=[{
"type": "text",
"text": long_document, # Thousands of tokens
"cache_control": {"type": "ephemeral"}
}],
messages=[{"role": "user", "content": question}]
)The block marked with cache_control is stored after the first call. Subsequent calls with the same prefix pay only the cache read price.
When to use it
Prompt caching is most effective when:
- Long documents as context: repeated analysis of the same document with different questions
- Extensive system prompts: complex instructions that repeat on every call
- Conversations with history: the history grows but the prefix remains
- Few-shot learning: the same examples are sent on every request
- Agents with tools: tool definitions repeat on every iteration of the agent loop
Considerations
- Order matters: caching works by prefix matching — changing block order invalidates the cache
- Write cost: the first cache write can cost more than a normal request (25% more on Anthropic)
- Expiration window: the cache expires after minutes of inactivity
- Not semantic: requires exact token matching, not meaning similarity
Why it matters
In AI applications with repetitive context — agents, RAG, document analysis — input token cost dominates the bill. Prompt caching turns a linear expense into a nearly constant one: the first request pays full price, but subsequent ones pay a fraction. For an agent that iterates 10 times with the same system prompt and tools, the difference can be 10x in cost.
References
- Prompt Caching — OpenAI Docs — Official OpenAI documentation.
- Prompt Caching: Cost & Performance Analysis — Artificial Analysis. Cross-provider comparison.
- Prompt Caching — Anthropic — Anthropic, 2024. Prompt caching implementation in Claude.