Prompt Caching

What it is

Prompt caching is an optimization offered by LLM providers that stores the internal computation (attention states) of prompt prefixes that repeat across API calls. Instead of reprocessing thousands of identical tokens on every request, the model reuses the previous computation and only processes the new tokens.

Unlike traditional software caching — which stores outputs like HTTP responses or query results — prompt caching stores processed inputs, because LLM outputs are dynamic and vary with each generation.

How it works

The process follows three steps:

First call: the provider processes the full prompt and stores the internal state of the prefix
Subsequent calls: the system detects the prefix matches a stored one and reuses the computation
Only the new part: the model processes only the tokens that differ from the cached prefix

The cache has a limited time window — typically 5 to 10 minutes of inactivity before expiring.

Provider implementations

Provider	Type	Minimum tokens	Discount	Latency
Anthropic (Claude)	Explicit — requires marking blocks with `cache_control`	1,024	90% on cached tokens	Up to 85% less
OpenAI (GPT-4o, o1)	Automatic — no code changes	1,024	50% on cached tokens	Variable reduction
Google (Gemini)	Explicit — requires manual configuration	Variable	Up to 75%	Variable
DeepSeek	Automatic	1,024	Up to 90%	Variable

Anthropic example

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[{
        "type": "text",
        "text": long_document,  # Thousands of tokens
        "cache_control": {"type": "ephemeral"}
    }],
    messages=[{"role": "user", "content": question}]
)

The block marked with cache_control is stored after the first call. Subsequent calls with the same prefix pay only the cache read price.

When to use it

Prompt caching is most effective when:

Long documents as context: repeated analysis of the same document with different questions
Extensive system prompts: complex instructions that repeat on every call
Conversations with history: the history grows but the prefix remains
Few-shot learning: the same examples are sent on every request
Agents with tools: tool definitions repeat on every iteration of the agent loop

Cost calculation

For an agent that iterates 10 times with a 4,000-token system prompt and 2,000-token tool definitions:

Scenario	Input tokens processed	Relative cost
No cache	10 × 6,000 = 60,000	100%
With cache (Anthropic)	6,000 + 9 × 600 = 11,400	~19%
With cache (OpenAI)	6,000 + 9 × 3,000 = 33,000	~55%

Savings scale with the number of iterations and prefix size. In RAG pipelines where the same document is analyzed with multiple questions, the pattern is identical.

Considerations

Order matters: caching works by prefix matching — changing block order invalidates the cache
Write cost: the first cache write can cost more than a normal request (25% more on Anthropic)
Expiration window: the cache expires after minutes of inactivity
Not semantic: requires exact token matching, not meaning similarity

Why it matters

In AI applications with repetitive context — agents, RAG, document analysis — input token cost dominates the bill. Prompt caching turns a linear expense into a nearly constant one: the first request pays full price, but subsequent ones pay a fraction. For an agent that iterates 10 times with the same system prompt and tools, the difference can be 10x in cost.

References

Prompt Caching — Anthropic — Anthropic, 2024. Prompt caching implementation in Claude with cache_control examples.
Prompt Caching 101 — OpenAI Cookbook — OpenAI, 2024. Practical guide with automatic caching examples.
Prompt Caching: Cost & Performance Analysis — Artificial Analysis. Cross-provider cost and latency comparison.
Prompt Caching — Amazon Bedrock — AWS, 2024. Prompt caching documentation for Bedrock.
Context Caching — Gemini API — Google, 2024. Context caching implementation in Gemini.

What it is

How it works

The process follows three steps:

First call: the provider processes the full prompt and stores the internal state of the prefix
Subsequent calls: the system detects the prefix matches a stored one and reuses the computation
Only the new part: the model processes only the tokens that differ from the cached prefix

The cache has a limited time window — typically 5 to 10 minutes of inactivity before expiring.

Provider implementations

Provider	Type	Minimum tokens	Discount	Latency
Anthropic (Claude)	Explicit — requires marking blocks with `cache_control`	1,024	90% on cached tokens	Up to 85% less
OpenAI (GPT-4o, o1)	Automatic — no code changes	1,024	50% on cached tokens	Variable reduction
Google (Gemini)	Explicit — requires manual configuration	Variable	Up to 75%	Variable
DeepSeek	Automatic	1,024	Up to 90%	Variable

Anthropic example

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[{
        "type": "text",
        "text": long_document,  # Thousands of tokens
        "cache_control": {"type": "ephemeral"}
    }],
    messages=[{"role": "user", "content": question}]
)

The block marked with cache_control is stored after the first call. Subsequent calls with the same prefix pay only the cache read price.

When to use it

Prompt caching is most effective when:

Long documents as context: repeated analysis of the same document with different questions
Extensive system prompts: complex instructions that repeat on every call
Conversations with history: the history grows but the prefix remains
Few-shot learning: the same examples are sent on every request
Agents with tools: tool definitions repeat on every iteration of the agent loop

Cost calculation

For an agent that iterates 10 times with a 4,000-token system prompt and 2,000-token tool definitions:

Scenario	Input tokens processed	Relative cost
No cache	10 × 6,000 = 60,000	100%
With cache (Anthropic)	6,000 + 9 × 600 = 11,400	~19%
With cache (OpenAI)	6,000 + 9 × 3,000 = 33,000	~55%

Savings scale with the number of iterations and prefix size. In RAG pipelines where the same document is analyzed with multiple questions, the pattern is identical.

Considerations

Order matters: caching works by prefix matching — changing block order invalidates the cache
Write cost: the first cache write can cost more than a normal request (25% more on Anthropic)
Expiration window: the cache expires after minutes of inactivity
Not semantic: requires exact token matching, not meaning similarity

Why it matters

References

Prompt Caching — Anthropic — Anthropic, 2024. Prompt caching implementation in Claude with cache_control examples.
Prompt Caching 101 — OpenAI Cookbook — OpenAI, 2024. Practical guide with automatic caching examples.
Prompt Caching: Cost & Performance Analysis — Artificial Analysis. Cross-provider cost and latency comparison.
Prompt Caching — Amazon Bedrock — AWS, 2024. Prompt caching documentation for Bedrock.
Context Caching — Gemini API — Google, 2024. Context caching implementation in Gemini.

Prompt Caching

What it is

How it works

Provider implementations

Anthropic example

When to use it

Cost calculation

Considerations

Why it matters

References

Related content

Prompt Caching

What it is

How it works

Provider implementations

Anthropic example

When to use it

Cost calculation

Considerations

Why it matters

References

Related content