Memory & Context

Prompt Caching

Reusing previously-processed input tokens at a 90 percent cost discount and lower latency.

Last updated: April 26, 2026

Definition

Prompt caching lets you mark a prefix of your prompt as "cacheable." On subsequent calls within ~5 minutes (Anthropic) the model reads the cached tokens instead of reprocessing. Cached input tokens cost ~90 percent less. Latency drops because the model skips the prefill step on cached tokens. Best fit: RAG agents that send the same system prompt and retrieved context across many tool-calling iterations. Real-world savings: 30 to 70 percent of total cost for typical agent workloads.

Code Example

python

# Anthropic prompt caching
response = client.messages.create(
    model="claude-sonnet-4-6",
    system=[
        {"type": "text", "text": LONG_SYSTEM_PROMPT,
         "cache_control": {"type": "ephemeral"}},
    ],
    messages=conversation,
)

Add cache_control to any prefix. The model auto-detects what to cache.

When To Use

Turn it on for any agent where the system prompt + tools + retrieved context is large and reused. Pure cost win. No quality tradeoff.

Related Terms

Context Window

The maximum number of tokens (input + output) a model can process in a single re…

RAG (Retrieval-Augmented Generation)

Fetching relevant documents at query time and injecting them into the LLM prompt…

Building with Prompt Caching?

I've shipped this pattern in real production systems. If you want a second pair of eyes on your architecture, that's what I do.

Book a discovery call Browse more terms