Jonatan Matajonmatum.com
conceptsnotesexperimentsessays
© 2026 Jonatan Mata. All rights reserved.v2.1.1
Concepts

Prompt Caching

Technique that stores the internal computation of reused prompt prefixes across LLM calls, reducing costs by up to 90% and latency by up to 85% in applications with repetitive context.

evergreen#prompt-caching#llm#cost-reduction#latency#anthropic#openai#optimization

What it is

Prompt caching is an optimization offered by LLM providers that stores the internal computation (attention states) of prompt prefixes that repeat across API calls. Instead of reprocessing thousands of identical tokens on every request, the model reuses the previous computation and only processes the new tokens.

Unlike traditional software caching — which stores outputs like HTTP responses or query results — prompt caching stores processed inputs, because LLM outputs are dynamic and vary with each generation.

How it works

The process follows three steps:

  1. First call: the provider processes the full prompt and stores the internal state of the prefix
  2. Subsequent calls: the system detects the prefix matches a stored one and reuses the computation
  3. Only the new part: the model processes only the tokens that differ from the cached prefix

The cache has a limited time window — typically 5 to 10 minutes of inactivity before expiring.

Provider implementations

ProviderTypeMinimum tokensDiscountLatency
Anthropic (Claude)Explicit — requires marking blocks with cache_control1,02490% on cached tokensUp to 85% less
OpenAI (GPT-4o, o1)Automatic — no code changes1,02450% on cached tokensVariable reduction
Google (Gemini)Explicit — requires manual configurationVariableUp to 75%Variable
DeepSeekAutomatic1,024Up to 90%Variable

Anthropic example

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[{
        "type": "text",
        "text": long_document,  # Thousands of tokens
        "cache_control": {"type": "ephemeral"}
    }],
    messages=[{"role": "user", "content": question}]
)

The block marked with cache_control is stored after the first call. Subsequent calls with the same prefix pay only the cache read price.

When to use it

Prompt caching is most effective when:

  • Long documents as context: repeated analysis of the same document with different questions
  • Extensive system prompts: complex instructions that repeat on every call
  • Conversations with history: the history grows but the prefix remains
  • Few-shot learning: the same examples are sent on every request
  • Agents with tools: tool definitions repeat on every iteration of the agent loop

Cost calculation

For an agent that iterates 10 times with a 4,000-token system prompt and 2,000-token tool definitions:

ScenarioInput tokens processedRelative cost
No cache10 × 6,000 = 60,000100%
With cache (Anthropic)6,000 + 9 × 600 = 11,400~19%
With cache (OpenAI)6,000 + 9 × 3,000 = 33,000~55%

Savings scale with the number of iterations and prefix size. In RAG pipelines where the same document is analyzed with multiple questions, the pattern is identical.

Considerations

  • Order matters: caching works by prefix matching — changing block order invalidates the cache
  • Write cost: the first cache write can cost more than a normal request (25% more on Anthropic)
  • Expiration window: the cache expires after minutes of inactivity
  • Not semantic: requires exact token matching, not meaning similarity

Why it matters

In AI applications with repetitive context — agents, RAG, document analysis — input token cost dominates the bill. Prompt caching turns a linear expense into a nearly constant one: the first request pays full price, but subsequent ones pay a fraction. For an agent that iterates 10 times with the same system prompt and tools, the difference can be 10x in cost.

References

  • Prompt Caching — Anthropic — Anthropic, 2024. Prompt caching implementation in Claude with cache_control examples.
  • Prompt Caching 101 — OpenAI Cookbook — OpenAI, 2024. Practical guide with automatic caching examples.
  • Prompt Caching: Cost & Performance Analysis — Artificial Analysis. Cross-provider cost and latency comparison.
  • Prompt Caching — Amazon Bedrock — AWS, 2024. Prompt caching documentation for Bedrock.
  • Context Caching — Gemini API — Google, 2024. Context caching implementation in Gemini.

Related content

  • Inference Optimization

    Techniques to reduce cost, latency, and resources needed to run language models in production, from quantization to distributed serving.

  • Context Windows

    The maximum number of tokens an LLM can process in a single interaction, determining how much information it can consider simultaneously to generate responses.

  • Cost Optimization

    Practices and strategies to minimize cloud spending without sacrificing performance, including right-sizing, reservations, spot instances, and eliminating idle resources.

  • Large Language Models

    Massive neural networks based on the Transformer architecture, trained on enormous text corpora to understand and generate natural language with emergent capabilities like reasoning, translation, and code generation.

  • Agentic Workflows

    Design patterns where AI agents execute complex multi-step tasks autonomously, combining reasoning, tool use, and iterative decision-making.

Concepts