Building for Indian Languages
The concepts that make Indian-language speech and document apps work in production — code-mixing, scripts, telephony audio, voices, and Indic OCR.
The concepts that make Indian-language speech and document apps work in production — code-mixing, scripts, telephony audio, voices, and Indic OCR.
Indian-language products have to handle realities that English-first stacks ignore: people mix English into every sentence, the same language is written in multiple scripts, a lot of audio arrives over 8kHz phone lines, and documents come as scans in regional scripts. This page maps those realities to the exact Sarvam speech, voice, and vision APIs — with code that’s been run against the live API.
Coverage differs by capability, so pick the model that matches your language set.
saaras:v3 exposes a mode parameter that controls how speech is written down — this is where most Indic-specific decisions happen:
The difference is easiest to see on one clip. For audio of “मुझे flight book करनी है” (a typical Hinglish sentence), the same audio produces:
Use codemix for chat/agent transcripts that feel natural, transcribe for clean native-script records, and translit when a downstream system only handles Latin script. The REST endpoint handles clips up to 30 seconds — for longer audio use the Batch API.
A large share of Indian voice traffic is phone audio: 8kHz, mono, often µ-law/A-law encoded. Two rules keep transcription quality high:
sample_rate=8000 both when opening the streaming connection and when sending each audio chunk. Mismatched rates cause garbled output.pcm_s16le, pcm_l16, pcm_raw) only.saaras:v3 is tuned for telephony, so prefer it for call audio.
See the Streaming STT guide for the full WebSocket lifecycle, VAD, reconnection, and voice-agent barge-in.
bulbul:v3 speaks 11 languages (10 Indian + English) with 30+ voices, and it’s built for the way Indians actually write — so you usually don’t pre-process anything.
enable_preprocessing=true when your text has lots of abbreviations or digits.speaker (e.g. shubh, priya, kavya) — see the voice list.output_audio_codec supports mulaw and alaw alongside mp3, wav, opus, etc.Brand names, abbreviations, and regional terms (“NEFT”, “KYC”, a company name) don’t always come out right by default. Create a Pronunciation Dictionary with Bulbul v3 to pin how specific words are spoken, then pass its dict_id to any TTS call (REST, HTTP stream, or WebSocket):
Indian documents arrive as scans, photos, and PDFs in regional scripts — often with tables and mixed languages. Sarvam Vision extracts text and structure across 23 languages (22 Indian + English), preserving the original script, and returns results as html, md, or json.
It’s an asynchronous job API: initialise a job, upload your files to the returned URLs, start it, poll status, then download the extracted output.
See the Document Digitization guide for the complete upload → start → poll → download flow.
For text (not speech), mayura:v1 (12 languages, colloquial modes, output-script and native-numeral control) and sarvam-translate:v1 (all 22 official Indian languages, formal) handle translation, and the Transliteration API converts between scripts. See the Translation guide for output_script and numerals_format options.