Jahanzaib
Models & Training

Throughput

How many requests or tokens an agent system can handle per unit of time. The metric that matters for batch workloads and high-volume APIs.

Last updated: April 26, 2026

Definition

Throughput measures volume: requests per second, tokens per second, or completed agent runs per minute. It matters when you have many users at once (chat at scale), when you batch process at scale (overnight document analysis), or when your provider has rate limits you bump against. Throughput optimization is different from latency optimization: it favors batching, parallel requests, and smaller models, while latency optimization favors streaming and smart caching. The right balance depends on workload. Production systems usually optimize for latency on user-facing paths and throughput on batch backends.

When To Use

Track throughput on any non-interactive workload. The right unit depends on the use case: requests/second for APIs, tokens/second for batch, runs/hour for scheduled agents.

Sources

Related Terms

Building with Throughput?

I've shipped this pattern in real production systems. If you want a second pair of eyes on your architecture, that's what I do.