SSML (Speech Synthesis Markup Language)
SSML is an XML-based markup standard that lets you control pronunciation, pacing, pauses, emphasis, and voice parameters in TTS output.
SSML (Speech Synthesis Markup Language) is a W3C-standard XML-based markup language used to control how text-to-speech systems pronounce and render text. With SSML, you can insert pauses, control speaking rate, adjust pitch and volume, emphasise words, specify pronunciation for unusual terms (using IPA or phoneme spelling), switch voices mid-sentence, and add prosodic cues like rising or falling intonation. SSML is used by TTS providers including Amazon Polly, Google Cloud TTS, Microsoft Azure Speech, and many purpose-built platforms. A simple SSML example: `<speak>Hello <break time="500ms"/> and welcome.</speak>` produces "Hello [half-second pause] and welcome." SSML is particularly useful for applications where precise pronunciation matters — brand names, technical jargon, foreign-language words, phone numbers, and currency amounts. Most major TTS platforms accept SSML input via their APIs, though the set of supported tags varies by provider.
How it works
Core SSML tags include `<break>` for pauses, `<emphasis>` for stressed words, `<prosody>` for rate/pitch/volume control, `<phoneme>` for custom pronunciation via IPA, `<sub>` for text substitution (useful for pronouncing acronyms), `<say-as>` for interpreting input (phone numbers, dates, ordinals), and `<voice>` for switching voices mid-speech. Advanced SSML supports `<lang>` tags for inline language switching (useful for mixed Hindi-English code-switched content) and `<lookup>` for custom pronunciation dictionaries. Not all SSML tags work on all platforms — Google Cloud TTS supports a slightly different set than Amazon Polly. Modern preset-based TTS platforms like VoisLabs abstract SSML complexity behind named tone presets (horror, YouTube, devotional, etc.) — creators pick a preset instead of hand-writing SSML.
Examples
Natural pauses
<speak>The Bhagavad Gita <break time="1s"/> is a 700-verse Hindu scripture.</speak> — creates a dramatic pause before key information.
Custom pronunciation
<phoneme alphabet="ipa" ph="ˈpuːnə">Pune</phoneme> — ensures the TTS pronounces the city as "Poo-nay" rather than the incorrect "Pyoon".
Mid-sentence voice switch
<speak>And now <voice name="Priya">a word from our Hindi expert</voice> — continuing in English.</speak>
Why this matters for Indian-language TTS
SSML is particularly important in Indian-language TTS for handling Hindi-English code-switching (extremely common in Indian content), proper pronunciation of Sanskrit-derived words, devotional content pacing, and phonetically irregular Indian names. VoisLabs supports SSML input via its API, while the web interface uses tone presets to abstract SSML — creators pick "Devotional" or "Horror" instead of writing `<prosody rate="slow" pitch="-2st">` tags manually.
Related terms
Text-to-Speech (TTS)
Text-to-speech (TTS) is the technology that converts written text into spoken audio using synthesise…
Neural TTS
Neural TTS uses deep learning to generate speech waveforms directly from text, producing voices that…
Prosody
Prosody is the rhythm, stress, intonation, and pacing patterns of speech — the musical dimension of …
Phoneme
A phoneme is the smallest distinct sound unit in a language that can change word meaning — e.g., the…
Tone Preset
A tone preset is a named configuration that applies a specific emotional style — horror, storytellin…
IPA (International Phonetic Alphabet)
The IPA is a universal notation system that represents every distinct speech sound in every human la…
Learn more
Frequently Asked Questions
Do I need to learn SSML to use VoisLabs?
Is SSML the same across all TTS providers?
What's the difference between SSML and tone presets?
Try VoisLabs — Indian-language TTS done right
1 minute free per day. 12 languages. Native Indian-script karaoke subtitles. No card required.
Start freeLast verified: 2026-04-21