Throughput
How many requests or tokens an agent system can handle per unit of time. The metric that matters for batch workloads and high-volume APIs.
Last updated: April 26, 2026
Definition
Throughput measures volume: requests per second, tokens per second, or completed agent runs per minute. It matters when you have many users at once (chat at scale), when you batch process at scale (overnight document analysis), or when your provider has rate limits you bump against. Throughput optimization is different from latency optimization: it favors batching, parallel requests, and smaller models, while latency optimization favors streaming and smart caching. The right balance depends on workload. Production systems usually optimize for latency on user-facing paths and throughput on batch backends.
When To Use
Track throughput on any non-interactive workload. The right unit depends on the use case: requests/second for APIs, tokens/second for batch, runs/hour for scheduled agents.
Related Terms
Building with Throughput?
I've shipped this pattern in real production systems. If you want a second pair of eyes on your architecture, that's what I do.