Text-to-Speech

SSML (Speech Synthesis Markup Language)

SSML is an XML-based markup standard that lets you control pronunciation, pacing, pauses, emphasis, and voice parameters in TTS output.

PPooja SharmaCo-founder, VoisLabs

LinkedInUpdated May 2026

SSML (Speech Synthesis Markup Language) is a W3C-standard XML-based markup language used to control how text-to-speech systems pronounce and render text. With SSML, you can insert pauses, control speaking rate, adjust pitch and volume, emphasise words, specify pronunciation for unusual terms (using IPA or phoneme spelling), switch voices mid-sentence, and add prosodic cues like rising or falling intonation. SSML is used by TTS providers including Amazon Polly, Google Cloud TTS, Microsoft Azure Speech, and many purpose-built platforms. A simple SSML example: `<speak>Hello <break time="500ms"/> and welcome.</speak>` produces "Hello [half-second pause] and welcome." SSML is particularly useful for applications where precise pronunciation matters — brand names, technical jargon, foreign-language words, phone numbers, and currency amounts. Most major TTS platforms accept SSML input via their APIs, though the set of supported tags varies by provider.

How it works

Core SSML tags include `<break>` for pauses, `<emphasis>` for stressed words, `<prosody>` for rate/pitch/volume control, `<phoneme>` for custom pronunciation via IPA, `<sub>` for text substitution (useful for pronouncing acronyms), `<say-as>` for interpreting input (phone numbers, dates, ordinals), and `<voice>` for switching voices mid-speech. Advanced SSML supports `<lang>` tags for inline language switching (useful for mixed Hindi-English code-switched content) and `<lookup>` for custom pronunciation dictionaries. Not all SSML tags work on all platforms — Google Cloud TTS supports a slightly different set than Amazon Polly. Modern preset-based TTS platforms like VoisLabs abstract SSML complexity behind named tone presets (horror, YouTube, devotional, etc.) — creators pick a preset instead of hand-writing SSML.

Examples

Natural pauses

<speak>The Bhagavad Gita <break time="1s"/> is a 700-verse Hindu scripture.</speak> — creates a dramatic pause before key information.

Custom pronunciation

<phoneme alphabet="ipa" ph="ˈpuːnə">Pune</phoneme> — ensures the TTS pronounces the city as "Poo-nay" rather than the incorrect "Pyoon".

Mid-sentence voice switch

<speak>And now <voice name="Priya">a word from our Hindi expert</voice> — continuing in English.</speak>

Why this matters for Indian-language TTS

SSML is particularly important in Indian-language TTS for handling Hindi-English code-switching (extremely common in Indian content), proper pronunciation of Sanskrit-derived words, devotional content pacing, and phonetically irregular Indian names. VoisLabs supports SSML input via its API, while the web interface uses tone presets to abstract SSML — creators pick "Devotional" or "Horror" instead of writing `<prosody rate="slow" pitch="-2st">` tags manually.

Related terms

Text-to-Speech (TTS)

Text-to-speech (TTS) is the technology that converts written text into spoken audio using synthesise…

Neural TTS

Neural TTS uses deep learning to generate speech waveforms directly from text, producing voices that…

Prosody

Prosody is the rhythm, stress, intonation, and pacing patterns of speech — the musical dimension of …

Phoneme

A phoneme is the smallest distinct sound unit in a language that can change word meaning — e.g., the…

Tone Preset

A tone preset is a named configuration that applies a specific emotional style — horror, storytellin…

IPA (International Phonetic Alphabet)

The IPA is a universal notation system that represents every distinct speech sound in every human la…

Learn more

VoisLabs TTS API Tone Presets

Frequently Asked Questions

Do I need to learn SSML to use VoisLabs?

No — the web interface uses tone presets that map to SSML under the hood. SSML is only needed if you integrate with the VoisLabs API and want fine-grained control over pauses, rate, or pronunciation beyond what the preset system provides.

Is SSML the same across all TTS providers?

Partially. The W3C SSML 1.1 spec defines the core tag set, but providers extend it differently. Amazon Polly supports tags Google Cloud doesn't, and vice versa. Always check the provider's SSML reference before porting scripts between platforms.

What's the difference between SSML and tone presets?

SSML is low-level markup requiring you to specify exact parameters (rate="slow", pitch="-2st"). Tone presets are high-level named configurations (Horror, Devotional, YouTube Commentary) that apply a tuned combination of those parameters automatically. Presets are faster for creators; SSML is more flexible for developers.

Try VoisLabs — Indian-language TTS done right

2 minutes free per day. 12 languages. Native Indian-script karaoke subtitles. No card required.

Start free

Last verified: 2026-04-21