Concepts

Prompt Caching

Technique that stores the internal computation of reused prompt prefixes across LLM calls, reducing costs by up to 90% and latency by up to 85% in applications with repetitive context.

seed#prompt-caching#llm#cost-reduction#latency#anthropic#openai#optimization

What it is

Prompt caching is an optimization offered by LLM providers that stores the internal computation (attention states) of prompt prefixes that repeat across API calls. Instead of reprocessing thousands of identical tokens on every request, the model reuses the previous computation and only processes the new tokens.

Unlike traditional software caching — which stores outputs like HTTP responses or query results — prompt caching stores processed inputs, because LLM outputs are dynamic and vary with each generation.

How it works

The process follows three steps:

  1. First call: the provider processes the full prompt and stores the internal state of the prefix
  2. Subsequent calls: the system detects the prefix matches a stored one and reuses the computation
  3. Only the new part: the model processes only the tokens that differ from the cached prefix

The cache has a limited time window — typically 5 to 10 minutes of inactivity before expiring.

Provider implementations

ProviderTypeMinimum tokensDiscountLatency
Anthropic (Claude)Explicit — requires marking blocks with cache_control1,02490% on cached tokensUp to 85% less
OpenAI (GPT-4o, o1)Automatic — no code changes1,02450% on cached tokensVariable reduction
Google (Gemini)Explicit — requires manual configurationVariableUp to 75%Variable
DeepSeekAutomatic1,024Up to 90%Variable

Anthropic example

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[{
        "type": "text",
        "text": long_document,  # Thousands of tokens
        "cache_control": {"type": "ephemeral"}
    }],
    messages=[{"role": "user", "content": question}]
)

The block marked with cache_control is stored after the first call. Subsequent calls with the same prefix pay only the cache read price.

When to use it

Prompt caching is most effective when:

  • Long documents as context: repeated analysis of the same document with different questions
  • Extensive system prompts: complex instructions that repeat on every call
  • Conversations with history: the history grows but the prefix remains
  • Few-shot learning: the same examples are sent on every request
  • Agents with tools: tool definitions repeat on every iteration of the agent loop

Considerations

  • Order matters: caching works by prefix matching — changing block order invalidates the cache
  • Write cost: the first cache write can cost more than a normal request (25% more on Anthropic)
  • Expiration window: the cache expires after minutes of inactivity
  • Not semantic: requires exact token matching, not meaning similarity

Why it matters

In AI applications with repetitive context — agents, RAG, document analysis — input token cost dominates the bill. Prompt caching turns a linear expense into a nearly constant one: the first request pays full price, but subsequent ones pay a fraction. For an agent that iterates 10 times with the same system prompt and tools, the difference can be 10x in cost.

References

Concepts