Jahanzaib
Models & Training

Latency

How long an agent takes to respond. Time-to-first-token, total response time, and tool-call round-trip latency are the three sub-metrics that matter most.

Last updated: April 26, 2026

Definition

Latency is the perceived speed of your agent. Three sub-metrics matter. First, time-to-first-token (TTFT): wall-clock time from request sent to first output token received. Drives whether the user sees the response start streaming quickly. Second, total response time: TTFT plus output generation time, which scales linearly with output length. Third, tool round-trip: for agents, the loop time of making one tool call and integrating the result. For voice agents, sub-800ms TTFT is the conversational target. For text chat, 1-2 seconds is acceptable. For batch workloads, latency does not matter; throughput does.

When To Use

Track p50 and p95 latency separately. The average hides the tail. p95 over 5 seconds in chat will lose users even if p50 is 1 second.

Sources

Related Terms

Building with Latency?

I've shipped this pattern in real production systems. If you want a second pair of eyes on your architecture, that's what I do.