Models & Training

Throughput

How many requests or tokens an agent system can handle per unit of time. The metric that matters for batch workloads and high-volume APIs.

Last updated: April 26, 2026

Definition

Throughput measures volume: requests per second, tokens per second, or completed agent runs per minute. It matters when you have many users at once (chat at scale), when you batch process at scale (overnight document analysis), or when your provider has rate limits you bump against. Throughput optimization is different from latency optimization: it favors batching, parallel requests, and smaller models, while latency optimization favors streaming and smart caching. The right balance depends on workload. Production systems usually optimize for latency on user-facing paths and throughput on batch backends.

When To Use

Track throughput on any non-interactive workload. The right unit depends on the use case: requests/second for APIs, tokens/second for batch, runs/hour for scheduled agents.

Sources

AWS Bedrock: Throughput modes

Related Terms

Latency

How long an agent takes to respond. Time-to-first-token, total response time, an…

Inference

Running a trained model to generate output. The day-to-day cost in any productio…

Rate Limiting

Capping how many requests a user, IP, or app can make in a time window.…

Small Language Model (SLM)

A lightweight LLM (typically 1 to 8B parameters) optimized for low cost, low lat…

Building with Throughput?

I've shipped this pattern in real production systems. If you want a second pair of eyes on your architecture, that's what I do.

Book a discovery call Browse more terms