Latency Budget
The allocation of acceptable delay across each stage of a voice agent pipeline so the total Time to First Audio meets the target.
Last updated: April 26, 2026
Definition
A latency budget is the per-stage allocation that keeps your total TTFA under target. For a 1000ms target you might allocate: 400ms turn detection (the largest single component, hard to shrink), 150ms STT final transcript, 300ms LLM time-to-first-token, 150ms TTS first-chunk. Sum: exactly 1000ms with no buffer. The discipline is to instrument every stage in production and watch which one breaks the budget first. The slowest stage is your optimization target. Without a budget, every stage feels equally important and nothing gets prioritized.
Two tactics consistently move latency budgets. First, keep the LLM small. Switching from Sonnet to Haiku (or GPT-5 to GPT-5 mini) typically halves the time-to-first-token at minimal quality cost for voice tasks, where responses are short and structured. Second, shrink the silence threshold for end-of-turn detection only as far as your false-trigger rate allows. Cutting from 800ms to 400ms saves 400ms on every turn but causes the agent to interrupt mid-sentence if the user pauses to think. The right value is workload-specific: faster for transactional, slower for emotional support calls.
When To Use
Define the budget before you ship. After launch, instrument and review weekly. Most voice agent latency problems are budget violations in one stage that creep in over time.
Related Terms
Building with Latency Budget?
I've shipped this pattern in real production systems. If you want a second pair of eyes on your architecture, that's what I do.