SSML (Speech Synthesis Markup Language)
XML-based markup that controls how a TTS engine pronounces text, including pacing, emphasis, pronunciation, and pauses.
Last updated: April 26, 2026
Definition
SSML is a W3C-standardized XML markup language for fine-grained control over text-to-speech output. You wrap your text in tags like <prosody rate="slow"> to slow down speech, <break time="500ms"/> to insert pauses, <say-as interpret-as="telephone"> to make a phone number read digit-by-digit, or <phoneme alphabet="ipa" ph="...">…</phoneme> to override pronunciation. Originally developed in the early 2000s and now version 1.1, SSML is supported (with engine-specific subsets) by Amazon Polly, Google Cloud TTS, ElevenLabs, Cartesia, and OpenAI. Always check which subset of SSML your specific TTS provider supports before relying on a tag.
For modern neural TTS, SSML matters less than it used to. The neural vocoder handles natural prosody and pronunciation reliably without explicit hints. The SSML tags that still earn their keep are: <break> for inserting deliberate pauses (especially important for spelling, phone numbers, and addresses), <say-as> for forcing specific interpretation of digits, dates, and currency, and <sub> for substituting pronunciation of brand names or jargon. Using SSML for emotion or rate is increasingly handled by newer TTS engines via voice settings or natural-language directives in the prompt instead.
Code Example
<speak>
Your verification code is
<say-as interpret-as="characters">A B 4 7</say-as>
<break time="500ms"/>
Please enter it within
<prosody rate="slow">five minutes</prosody>.
</speak>Common SSML pattern for verification codes: spell out characters, pause before instruction, slow down the deadline.
When To Use
Use SSML for digit-strings (codes, phone numbers, account IDs), brand-name pronunciation, and deliberate pauses. Skip it for emotional control on neural TTS that supports voice settings or natural-language style hints.
Related Terms
Building with SSML (Speech Synthesis Markup Language)?
I've shipped this pattern in real production systems. If you want a second pair of eyes on your architecture, that's what I do.