Time to First Audio (TTFA)
The total latency from when the user stops speaking to when the agent's first audio chunk plays back. The single most important voice agent metric.
Last updated: April 26, 2026
Definition
Time to First Audio is the wall-clock latency from end-of-user-utterance to first-audio-out. It is the sum of: end-of-turn detection delay (typically 300 to 500ms after the user stops speaking), STT to final transcript (~150ms), LLM time-to-first-token (~200 to 400ms with streaming), and TTS time-to-first-chunk (~150 to 250ms). Total: 800ms to 1300ms in production. Below 800ms feels conversational. Above 1500ms feels broken. TTFA is the metric users perceive as "the agent's response time," and it is the single biggest determinant of whether a voice agent feels alive or dead.
Two optimizations dominate the TTFA fight. First, streaming everywhere: STT streaming partials to the LLM, LLM streaming tokens to the TTS, TTS streaming audio chunks to the caller. Without streaming, your TTFA is the SUM of full-stage latencies, which is impossible to keep under 2 seconds. Second, model selection: a Haiku-class model at 200ms time-to-first-token is dramatically better for voice than an Opus-class model at 800ms, even if Opus is smarter. The right pattern is small fast LLM for voice, escalate to larger model only when needed.
When To Use
Track TTFA on every voice call. Set an alert when the p95 exceeds your target (typically 1000ms for B2C, 1500ms for B2B). Optimize the slowest stage first.
Related Terms
Building with Time to First Audio (TTFA)?
I've shipped this pattern in real production systems. If you want a second pair of eyes on your architecture, that's what I do.