Models & Training

Latency

How long an agent takes to respond. Time-to-first-token, total response time, and tool-call round-trip latency are the three sub-metrics that matter most.

Last updated: April 26, 2026

Definition

Latency is the perceived speed of your agent. Three sub-metrics matter. First, time-to-first-token (TTFT): wall-clock time from request sent to first output token received. Drives whether the user sees the response start streaming quickly. Second, total response time: TTFT plus output generation time, which scales linearly with output length. Third, tool round-trip: for agents, the loop time of making one tool call and integrating the result. For voice agents, sub-800ms TTFT is the conversational target. For text chat, 1-2 seconds is acceptable. For batch workloads, latency does not matter; throughput does.

When To Use

Track p50 and p95 latency separately. The average hides the tail. p95 over 5 seconds in chat will lose users even if p50 is 1 second.

Sources

Related Terms

Inference

Running a trained model to generate output. The day-to-day cost in any productio…

Throughput

How many requests or tokens an agent system can handle per unit of time. The met…

Time to First Audio (TTFA)

The total latency from when the user stops speaking to when the agent's first au…

Latency Budget

The allocation of acceptable delay across each stage of a voice agent pipeline s…

Building with Latency?

I've shipped this pattern in real production systems. If you want a second pair of eyes on your architecture, that's what I do.

Book a discovery call Browse more terms