Jahanzaib
Voice & Audio

Turn Detection

How a voice agent decides when the caller has stopped speaking and it is the agent's turn to respond.

Last updated: April 26, 2026

Definition

Turn detection (also called end-of-turn detection or end-of-utterance) is the deceptively hard problem of knowing when to start responding. The simplest approach is voice activity detection (VAD) plus a silence threshold: when the audio level drops below a threshold for N milliseconds, declare the turn over. This works for clean transactional speech but fails when users pause to think. Modern systems use semantic turn detection: a small classifier model that looks at the partial transcript and predicts whether the utterance is grammatically complete. LiveKit, Pipecat, and Vapi all ship semantic turn detectors that significantly outperform pure VAD.

Bad turn detection has two failure modes. First, false-cut: the agent starts talking while the user is still thinking, which is rude and frustrating. Second, slow-cut: the agent waits a beat too long after the user finishes, which makes the conversation feel sluggish. The two failure modes pull in opposite directions, and tuning is workload-specific. Customer support calls tolerate slower turn detection (long pauses are natural). Order-taking and appointment booking need fast turn detection (users speak in short bursts). Always log false-cut events; they are the single biggest cause of voice agent abandonment.

When To Use

Every voice agent needs turn detection. Use semantic turn detection for B2C, faster threshold-based for transactional B2B.

Sources

Related Terms

Building with Turn Detection?

I've shipped this pattern in real production systems. If you want a second pair of eyes on your architecture, that's what I do.