Voice & Audio

Text-to-Speech (TTS)

Neural model that synthesizes natural-sounding speech audio from the LLM's text output, streamed back to the caller.

Last updated: April 26, 2026

Definition

Text-to-speech (also called speech synthesis) is the final stage of the voice agent pipeline. The LLM emits text, the TTS model renders it as audio, and that audio streams to the caller. Modern TTS uses neural vocoders that produce voices indistinguishable from humans in most short utterances. Top providers in 2026: ElevenLabs (best naturalness, 30+ languages), Cartesia Sonic (lowest latency, ~90ms first chunk), PlayHT, Rime, and OpenAI TTS. The choice depends on your latency budget, the languages you need, and whether you require voice cloning.

Latency from text input to first audio chunk is the metric that matters. Cartesia and ElevenLabs Flash both deliver under 200ms. Older or non-streaming TTS APIs that wait for the full text before synthesis push your total Time to First Audio over a second, which makes the conversation feel laggy. SSML tags let you control pronunciation, pacing, and emphasis, but each TTS engine supports a different SSML subset. Always test SSML on your specific provider before relying on it. For multi-language voice agents, the same voice ID rarely sounds equally good in all languages; expect to use a different voice per language.