Memory & Context

Context Window

The maximum number of tokens (input + output) a model can process in a single request.

Last updated: April 26, 2026

Definition

The context window is the model's working memory limit. Apr 2026 numbers: Claude Opus/Sonnet 4.6 = 200K tokens, Claude with extended caching = 1M, GPT-5.4 = 256K, Gemini 2.5 Pro = 2M. Window size matters most for: long documents (legal contracts, codebases), multi-turn conversations (chat history grows), and agent loops (each iteration adds observations). Hitting the limit causes a hard error or silent truncation depending on the API. Production agents always need a strategy for what to drop when context fills up.

Code Example

python

# Approximate token counts before sending
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
total = sum(len(enc.encode(m["content"])) for m in messages)
if total > 180_000:  # Leave headroom for response
    messages = compress_old_turns(messages)

Always count tokens before sending. Truncation strategies vary; compress > drop > error.

When To Use

Plan for context limits from day one. The cheapest fix is summarization of old turns. The most reliable is sliding-window plus a separate long-term memory store.

Related Terms

Short-Term Memory

The conversation history kept in the LLM's context window during a single sessio…

Long-Term Memory

Persistent storage outside the context window. Typically a vector database, key-…

Prompt Caching

Reusing previously-processed input tokens at a 90 percent cost discount and lowe…

Building with Context Window?

I've shipped this pattern in real production systems. If you want a second pair of eyes on your architecture, that's what I do.

Book a discovery call Browse more terms