40 terms · Text-to-speech, Indian scripts, audio, video

TTS, Audio & Video Glossary

Plain-English definitions of 40 terms you\'ll encounter in Indian-language TTS, video creation, and audio production

From Devanagari and sandhi to bitrate and karaoke subtitles — a reference written for creators, developers, agencies, and anyone working with Indian-language voice and video. Each term includes an answer-first definition, concrete examples, why it matters in Indian-language contexts, and 3 FAQs. Updated April 2026.

VoisLabs TeamUpdated March 2026

Text-to-Speech

10 terms

Text-to-Speech (TTS)

Text-to-speech (TTS) is the technology that converts written text into spoken audio using synthesised or AI-generated voices.

Neural TTS

Neural TTS uses deep learning to generate speech waveforms directly from text, producing voices that sound nearly indistinguishable from human recordings.

SSML (Speech Synthesis Markup Language)

SSML is an XML-based markup standard that lets you control pronunciation, pacing, pauses, emphasis, and voice parameters in TTS output.

Voice Cloning

Voice cloning is AI-based synthesis of a target person's voice from a short audio sample, producing a digital replica that can read any text.

Tone Preset

A tone preset is a named configuration that applies a specific emotional style — horror, storytelling, devotional, etc. — to TTS output.

Prosody

Prosody is the rhythm, stress, intonation, and pacing patterns of speech — the musical dimension of spoken language beyond individual phonemes.

Phoneme

A phoneme is the smallest distinct sound unit in a language that can change word meaning — e.g., the /p/ vs /b/ in "pat" vs "bat".

IPA (International Phonetic Alphabet)

The IPA is a universal notation system that represents every distinct speech sound in every human language with a unique symbol.

Speech Synthesis

Speech synthesis is the umbrella term for artificially producing human speech — includes text-to-speech, voice cloning, and voice conversion.

TTS API

A TTS API is a programmatic interface that lets developers convert text to speech audio via HTTP requests for apps, bots, and automation.

Scripts & Linguistics

10 terms

Devanagari

Devanagari (देवनागरी) is the script used to write Hindi, Marathi, Nepali, Sanskrit, and several other North Indian languages.

Tamil Script

The Tamil script (தமிழ் எழுத்து) is a Brahmi-derived abugida used to write Tamil, one of the oldest classical languages of India.

Malayalam Script

The Malayalam script (മലയാളം ലിപി) is a Brahmi-derived writing system used for Malayalam, the classical language of Kerala.

Gurmukhi

Gurmukhi (ਗੁਰਮੁਖੀ) is the script used to write Punjabi in India, developed for the Guru Granth Sahib and Sikh religious texts.

Nastaliq

Nastaliq (نستعلیق) is the Perso-Arabic calligraphic style used to write Urdu, with flowing diagonal strokes and complex ligatures.

Sandhi

Sandhi is the phonetic junction where adjacent sounds merge or modify each other — critical in Indian languages for correct pronunciation.

Conjunct Consonant

A conjunct consonant is a single glyph formed by combining two or more consonant letters in Indic scripts — essential for correct rendering.

Matra

A matra is a dependent vowel sign in Indic scripts that attaches to a consonant to indicate the vowel that follows it.

Diacritic

A diacritic is a mark added to a base letter to modify its pronunciation, typically indicating accent, tone, length, or nasalisation.

Text Shaping

Text shaping is the process of converting a sequence of Unicode characters into positioned glyphs for display, handling ligatures and complex scripts.

Video & Captions

10 terms

Karaoke Subtitles

Karaoke subtitles highlight each word or syllable as it is spoken, similar to how song lyrics appear on karaoke screens.

Captions

Captions are time-synchronised text displayed on video to represent spoken dialogue, sound effects, or speaker identification.

SRT File (SubRip Subtitle)

An SRT file is a simple text format for time-coded subtitles, widely supported across video editors, players, and streaming platforms.

Closed Captions

Closed captions are subtitles stored in a separate track that viewers can toggle on or off, supporting accessibility and multi-language viewing.

Burned-in Subtitles

Burned-in subtitles are permanently rendered into the video image — always visible, can't be toggled off by viewers.

Aspect Ratio

Aspect ratio is the proportional relationship between video width and height — 9:16 for Shorts, 16:9 for standard YouTube, 1:1 for square posts.

Audio Visualizer

An audio visualizer converts an audio waveform into an animated visual — commonly used for podcast clips, music tracks, and audiogram videos.

Audiogram

An audiogram is a short audio clip presented as a video, typically with an audio waveform, speaker image, and captions — used to promote podcasts on social media.

Faceless YouTube Channel

A faceless YouTube channel produces videos without showing the creator on camera — using AI voice or voiceover, stock footage, and captions.

Video Automation Pipeline

A video automation pipeline is a workflow that produces finished videos from input text or audio with minimal manual editing, typically using AI TTS and stock visuals.

Audio Formats

10 terms

MP3

MP3 is a lossy audio compression format that produces small files with good audio quality — the de facto standard for podcast and music distribution.

WAV

WAV (Waveform Audio File Format) is an uncompressed audio container developed by Microsoft and IBM, storing full-fidelity audio without compression artifacts.

AAC (Advanced Audio Coding)

AAC is a lossy audio codec that produces better audio quality than MP3 at the same bitrate — the default format on Apple devices and YouTube.

M4A

M4A is an MPEG-4 container format for audio files, typically using AAC compression — the default format for iPhone Voice Memos and iTunes.

Bitrate

Bitrate is the amount of data used per second of audio, measured in kbps — higher bitrate means better quality and larger files.

Sample Rate

Sample rate is how many times per second audio is measured — 44.1 kHz is CD standard, 48 kHz is video-production standard.

Mono vs Stereo

Mono is single-channel audio; stereo is two-channel (left + right) audio with directional information. Voice is typically mono; music is typically stereo.

Podcast

A podcast is an episodic audio program distributed via RSS feed, typically downloadable or streamable on podcast apps like Spotify, Apple Podcasts, and JioSaavn.

Voiceover

Voiceover is spoken narration added to video, animation, or audio content — can be human-recorded or AI-generated, used for explainers, ads, and storytelling.

Dubbing

Dubbing is replacing the original audio track of a video (typically dialogue) with translated or re-recorded voice in another language.

Try VoisLabs — TTS built for Indian languages

48 tone presets, karaoke subtitles in native Indian scripts, audio-to-video pipeline. 1 minute free per day.

Start free