Voice & Audio

Speech-to-Text (STT)

The first stage of a voice agent pipeline that transcribes user audio into text in real time using deep neural networks.

Last updated: April 26, 2026

Definition

Speech-to-text (also called automatic speech recognition or ASR) converts incoming audio into text the LLM can consume. Production STT for voice agents needs three things: real-time streaming (transcripts arrive token-by-token, not after the full utterance), accurate end-of-utterance detection, and low word error rate across accents and noisy phone-line audio. Top providers in 2026 are Deepgram (Nova-3), AssemblyAI, OpenAI Whisper (when self-hosted via faster-whisper), and Speechmatics. Phone-call audio is typically 8kHz mono PCM, which all major providers handle natively.

Two metrics matter most. First, latency to first interim transcript: production targets under 200ms. Second, word error rate (WER) on your specific user population: a Scottish accent or strong Indian English on a noisy phone line will perform very differently from clean studio audio in a benchmark. Always test STT on real recordings from your actual user base before committing. Keyterm boosting (telling the STT to pay extra attention to your product names, customer-specific vocabulary, or rare proper nouns) typically cuts the WER on critical terms by 30 to 50 percent.

When To Use

Required for any voice agent built on the standard three-stage pipeline. Skip only if using a unified speech-to-speech model.

Sources

Related Terms

STT → LLM → TTS Pipeline

The three-stage architecture of every modern voice agent: speech to text, then l…

Text-to-Speech (TTS)

Neural model that synthesizes natural-sounding speech audio from the LLM's text …

Turn Detection

How a voice agent decides when the caller has stopped speaking and it is the age…

Latency Budget

The allocation of acceptable delay across each stage of a voice agent pipeline s…

Building with Speech-to-Text (STT)?

I've shipped this pattern in real production systems. If you want a second pair of eyes on your architecture, that's what I do.

Book a discovery call Browse more terms