Voice & Audio

STT → LLM → TTS Pipeline

The three-stage architecture of every modern voice agent: speech to text, then language model, then text to speech.

Last updated: April 26, 2026

Definition

Almost every production voice agent in 2026 uses the same three-stage pipeline. Speech-to-text (STT) transcribes the caller's audio in real time. The text goes to an LLM that decides what to say back. Text-to-speech (TTS) renders the LLM's reply as audio and streams it to the speaker. This pipeline is what platforms like Vapi, Retell, Bland, LiveKit Agents, and Pipecat all wrap. Each stage adds latency, and the sum determines whether the conversation feels natural. Streaming at every stage (token-by-token from the LLM, audio-chunk-by-chunk from TTS) is what cuts perceived latency below the natural-conversation threshold of about 800ms.

Newer architectures collapse this pipeline. OpenAI's Realtime API and Google's Gemini Live use a single speech-to-speech model that takes audio in and produces audio out, skipping the explicit text intermediate. The tradeoff: lower latency and more natural prosody, but harder to debug, harder to cite sources from RAG, and harder to enforce structured tool-calling. Production systems in regulated industries still favor the three-stage pipeline because every stage produces text logs that satisfy compliance. The end-to-end speech models will dominate in lower-stakes consumer use cases first.

Architecture

The canonical voice agent pipeline. Each box adds latency; total Time to First Audio determines whether the conversation feels natural. Sub-800ms is the production target.

Code Example

typescript

// LiveKit Agents pipeline (streaming end-to-end)
import { defineAgent, AgentSession } from '@livekit/agents';
import { deepgram } from '@livekit/agents-plugin-deepgram';
import { anthropic } from '@livekit/agents-plugin-anthropic';
import { elevenlabs } from '@livekit/agents-plugin-elevenlabs';

export default defineAgent(async (ctx) => {
  const session = new AgentSession({
    stt: deepgram.STT({ model: 'nova-3' }),
    llm: anthropic.LLM({ model: 'claude-haiku-4-5' }),
    tts: elevenlabs.TTS({ voiceId: '...' }),
  });
  await session.start({ room: ctx.room });
});

A real LiveKit Agents config wiring all three stages with streaming. ~10 lines for a working voice agent.

When To Use

Default architecture for any production voice agent in 2026. Switch to a unified speech-to-speech model (OpenAI Realtime, Gemini Live) only when latency matters more than auditability and structured tool use.

Common Questions

Why not just use a single speech-to-speech model?

Speech-to-speech models (OpenAI Realtime, Gemini Live) are excellent for low-latency consumer voice but harder to log, harder to use with RAG, and weaker at structured tool calling. The three-stage pipeline is still the right choice for regulated industries (healthcare, finance, legal) where every utterance must be auditable.