Prompt Caching
Reusing previously-processed input tokens at a 90 percent cost discount and lower latency.
Last updated: April 26, 2026
Definition
Prompt caching lets you mark a prefix of your prompt as "cacheable." On subsequent calls within ~5 minutes (Anthropic) the model reads the cached tokens instead of reprocessing. Cached input tokens cost ~90 percent less. Latency drops because the model skips the prefill step on cached tokens. Best fit: RAG agents that send the same system prompt and retrieved context across many tool-calling iterations. Real-world savings: 30 to 70 percent of total cost for typical agent workloads.
Code Example
# Anthropic prompt caching
response = client.messages.create(
model="claude-sonnet-4-6",
system=[
{"type": "text", "text": LONG_SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"}},
],
messages=conversation,
)Add cache_control to any prefix. The model auto-detects what to cache.
When To Use
Turn it on for any agent where the system prompt + tools + retrieved context is large and reused. Pure cost win. No quality tradeoff.
Building with Prompt Caching?
I've shipped this pattern in real production systems. If you want a second pair of eyes on your architecture, that's what I do.