Speech-to-Text (STT)
The first stage of a voice agent pipeline that transcribes user audio into text in real time using deep neural networks.
Last updated: April 26, 2026
Definition
Speech-to-text (also called automatic speech recognition or ASR) converts incoming audio into text the LLM can consume. Production STT for voice agents needs three things: real-time streaming (transcripts arrive token-by-token, not after the full utterance), accurate end-of-utterance detection, and low word error rate across accents and noisy phone-line audio. Top providers in 2026 are Deepgram (Nova-3), AssemblyAI, OpenAI Whisper (when self-hosted via faster-whisper), and Speechmatics. Phone-call audio is typically 8kHz mono PCM, which all major providers handle natively.
Two metrics matter most. First, latency to first interim transcript: production targets under 200ms. Second, word error rate (WER) on your specific user population: a Scottish accent or strong Indian English on a noisy phone line will perform very differently from clean studio audio in a benchmark. Always test STT on real recordings from your actual user base before committing. Keyterm boosting (telling the STT to pay extra attention to your product names, customer-specific vocabulary, or rare proper nouns) typically cuts the WER on critical terms by 30 to 50 percent.
When To Use
Required for any voice agent built on the standard three-stage pipeline. Skip only if using a unified speech-to-speech model.
Related Terms
Building with Speech-to-Text (STT)?
I've shipped this pattern in real production systems. If you want a second pair of eyes on your architecture, that's what I do.