Latency
How long an agent takes to respond. Time-to-first-token, total response time, and tool-call round-trip latency are the three sub-metrics that matter most.
Last updated: April 26, 2026
Definition
Latency is the perceived speed of your agent. Three sub-metrics matter. First, time-to-first-token (TTFT): wall-clock time from request sent to first output token received. Drives whether the user sees the response start streaming quickly. Second, total response time: TTFT plus output generation time, which scales linearly with output length. Third, tool round-trip: for agents, the loop time of making one tool call and integrating the result. For voice agents, sub-800ms TTFT is the conversational target. For text chat, 1-2 seconds is acceptable. For batch workloads, latency does not matter; throughput does.
When To Use
Track p50 and p95 latency separately. The average hides the tail. p95 over 5 seconds in chat will lose users even if p50 is 1 second.
Related Terms
Building with Latency?
I've shipped this pattern in real production systems. If you want a second pair of eyes on your architecture, that's what I do.