Jahanzaib
Models & Training

Tokenization

How text gets broken into the discrete units (tokens) that LLMs process. The unit of both pricing and context-window measurement.

Last updated: April 26, 2026

Definition

Tokenization is the process of converting text into the integer IDs the LLM actually sees. Modern models use byte-pair encoding (BPE) variants that split text into sub-word units: common words become a single token, rare words split into multiple tokens, code and emoji often take more tokens than equivalent prose. Approximate rule of thumb: 1 token ≈ 4 characters in English ≈ 0.75 words. The exact count varies by model: Claude, GPT, and Gemini all use different tokenizers, so the same text uses slightly different token counts on each. Pricing and context-window limits are measured in tokens, so understanding them matters for cost and latency planning.

Two practical takeaways. First, use the official tokenizer for your provider when estimating cost or context usage; rough character-based estimates are off by 20 to 30 percent for code or non-English text. OpenAI ships tiktoken, Anthropic ships @anthropic-ai/tokenizer. Second, non-English languages tokenize less efficiently. Mandarin, Arabic, and many other scripts use 2 to 4x more tokens per character than English, which means your context window is effectively smaller and costs higher for those languages.

When To Use

Measure tokens, not characters, when budgeting context, predicting cost, or sizing requests. Use the official tokenizer per provider.

Sources

Related Terms

Building with Tokenization?

I've shipped this pattern in real production systems. If you want a second pair of eyes on your architecture, that's what I do.