Voice & Audio

Neural Vocoder

The deep learning model inside a TTS system that converts predicted acoustic features into actual audio waveforms.

Last updated: April 26, 2026

Definition

Modern text-to-speech is two stages internally. First, an acoustic model predicts a mel-spectrogram (a time-frequency representation) from the input text. Second, a neural vocoder converts that spectrogram into audio waveform samples you can play. Vocoders like HiFi-GAN, WaveGlow, and the more recent Vocos and BigVGAN are what give 2026-era TTS its near-human naturalness. They replaced older signal-processing vocoders (Griffin-Lim, WORLD) that produced obviously synthetic-sounding audio. End users do not interact with the vocoder directly; it lives inside the TTS provider's stack.

When To Use

You will not configure the vocoder directly when using a hosted TTS API. Knowing the term helps when reading TTS papers, evaluating self-hosted TTS options, or debugging audio quality problems.

Sources

Related Terms

Text-to-Speech (TTS)

Neural model that synthesizes natural-sounding speech audio from the LLM's text …

Voice Cloning

Using a few seconds of reference audio to synthesize new speech in that specific…

STT → LLM → TTS Pipeline

The three-stage architecture of every modern voice agent: speech to text, then l…

Building with Neural Vocoder?

I've shipped this pattern in real production systems. If you want a second pair of eyes on your architecture, that's what I do.

Book a discovery call Browse more terms