Scripts & Linguistics

Diacritic

A diacritic is a mark added to a base letter to modify its pronunciation, typically indicating accent, tone, length, or nasalisation.

VoisLabs TeamUpdated March 2026

A diacritic (also called a diacritical mark or accent) is a symbol added to a base letter to modify its pronunciation. Diacritics appear across many writing systems: French uses accents (é, è, ê), German uses the umlaut (ä, ö, ü), Spanish uses the tilde (ñ), and Indic scripts use various marks including matras (vowel signs), anusvara (nasalisation), visarga (aspiration), virama/halant (vowel-cancelling), and Urdu Nastaliq uses hamza (ء) and other Perso-Arabic marks. Arabic and Urdu also use tashkeel (diacritic marks indicating short vowels), which are usually omitted in casual writing but included in sacred texts, beginner materials, and TTS-optimised input. In TTS, diacritics provide pronunciation information — a TTS system reading an Arabic or Urdu text without tashkeel must guess short vowels; with tashkeel included, pronunciation is unambiguous. Proper diacritic rendering requires font support for the specific marks and positioning logic for combining them with base letters. Unicode encodes diacritics as combining characters — the base letter plus the diacritic form a grapheme cluster.

How it works

Diacritics fall into categories by function: vowel-indicating (French é, Indic matras, Arabic tashkeel fatha/kasra/damma), tone-indicating (Mandarin Pinyin marks, some African languages), nasalisation (Indic anusvara ं, Spanish ñ, Portuguese ã), length-indicating (some languages mark long vowels with macron ā), stress-marking (Spanish acute accent on stressed syllables), and cancellation (Indic virama/halant indicating the consonant has no following vowel). In Indic scripts specifically, diacritic marks include: anusvara (ं, nasalisation), visarga (ः, voiceless aspiration at word end), chandrabindu (ँ, nasalisation of preceding vowel), virama/halant (्, consonant-only indicator used in conjuncts), and various matras. Urdu Nastaliq uses Arabic-derived diacritics: fatha, kasra, damma (short vowels), shadda (consonant doubling), sukun (no vowel), tanwin (indefinite noun markers). Rendering diacritics correctly requires fonts with appropriate marks and shaping rules that position them correctly relative to base letters — complex in Nastaliq where base letter shape depends on neighbours.

Examples

Indic anusvara

हिन्दी (Hindi) vs हिंदी — the anusvara (ं) indicates nasalisation of the preceding vowel. Both spellings appear in modern usage; the anusvara form is more common.

Arabic tashkeel

Arabic كَتَبَ (kataba, "he wrote") vs كتب (ktb, unvowelled). The tashkeel marks show the /a/ vowels explicitly; unvowelled text requires the reader to infer vowels from context.

Gurmukhi addak

ਪੱਕਾ (pakkā, "firm") uses the addak (ੱ) to indicate doubled consonant. Without it, the word would read as ਪਕਾ (pakā), different pronunciation and meaning.

Why this matters for Indian-language TTS

Diacritics are central to Indian-language TTS accuracy. Hindi's anusvara and chandrabindu distinction, Malayalam's sandhi marks, Urdu's tashkeel (for religious and formal text), and Gurmukhi's addak all directly affect TTS pronunciation. A TTS system that silently drops or mis-handles diacritics produces noticeably wrong audio. VoisLabs' Indic input pipeline preserves all diacritic marks and uses them in pronunciation decisions.

Frequently Asked Questions

Do I need to include diacritics in TTS input?
For Hindi and most Indian languages using Devanagari, yes — standard written Hindi includes anusvara and chandrabindu and TTS uses them. For Urdu, casual text typically omits tashkeel; TTS quality improves if you add tashkeel for ambiguous words. VoisLabs handles both cases but produces more accurate Urdu with tashkeel included.
How are diacritics stored in Unicode?
As separate code points combining with the base letter. The string "कि" (ki) is stored as क (U+0915) + ि (U+093F) — two code points forming one grapheme cluster. Software must handle grapheme clusters correctly to avoid cutting between base and diacritic.
Why do some tools drop diacritics in Indian-language output?
Tools built for Latin scripts may pre-process text to strip "unknown" characters, including Indic diacritics. This silently breaks Indian-language content. Quality Indic TTS preserves all diacritics through the pipeline — both in input text handling and subtitle output.

Try VoisLabs — Indian-language TTS done right

1 minute free per day. 12 languages. Native Indian-script karaoke subtitles. No card required.

Start free

Last verified: 2026-04-21