Building for Indian Languages
The concepts that make Indian-language speech and document apps work in production — code-mixing, scripts, telephony audio, voices, and Indic OCR.
Indian-language products have to handle realities that English-first stacks ignore: people mix English into every sentence, the same language is written in multiple scripts, a lot of audio arrives over 8kHz phone lines, and documents come as scans in regional scripts. This page maps those realities to the exact Sarvam speech, voice, and vision APIs — with code that’s been run against the live API.
Language coverage at a glance
Coverage differs by capability, so pick the model that matches your language set.
Speech-to-Text: transcribe real Indian audio
saaras:v3 exposes a mode parameter that controls how speech is written down — this is where most Indic-specific decisions happen:
The difference is easiest to see on one clip. For audio of “मुझे flight book करनी है” (a typical Hinglish sentence), the same audio produces:
Use codemix for chat/agent transcripts that feel natural, transcribe for clean native-script records, and translit when a downstream system only handles Latin script. The REST endpoint handles clips up to 30 seconds — for longer audio use the Batch API.
Telephony and 8kHz audio
A large share of Indian voice traffic is phone audio: 8kHz, mono, often µ-law/A-law encoded. Two rules keep transcription quality high:
- Match the sample rate everywhere. For 8kHz audio, set
sample_rate=8000both when opening the streaming connection and when sending each audio chunk. Mismatched rates cause garbled output. - Use a supported streaming codec. Streaming STT accepts WAV and raw PCM (
pcm_s16le,pcm_l16,pcm_raw) only.
saaras:v3 is tuned for telephony, so prefer it for call audio.
See the Streaming STT guide for the full WebSocket lifecycle, VAD, reconnection, and voice-agent barge-in.
Text-to-Speech: natural Indian voices
bulbul:v3 speaks 11 languages (10 Indian + English) with 30+ voices, and it’s built for the way Indians actually write — so you usually don’t pre-process anything.
- Pass code-mixed text directly. Mixed Hindi-English (“आपका OTP 4321 है। Please use it…”) is spoken naturally; no need to romanize or split it.
- Normalize English words and numbers with
enable_preprocessing=truewhen your text has lots of abbreviations or digits. - Pick a voice with
speaker(e.g.shubh,priya,kavya) — see the voice list. - Output to a phone line by requesting a telephony codec —
output_audio_codecsupportsmulawandalawalongsidemp3,wav,opus, etc.
Pronunciation control
Brand names, abbreviations, and regional terms (“NEFT”, “KYC”, a company name) don’t always come out right by default. Create a Pronunciation Dictionary with Bulbul v3 to pin how specific words are spoken, then pass its dict_id to any TTS call (REST, HTTP stream, or WebSocket):
Document Digitization: Indic OCR with Sarvam Vision
Indian documents arrive as scans, photos, and PDFs in regional scripts — often with tables and mixed languages. Sarvam Vision extracts text and structure across 23 languages (22 Indian + English), preserving the original script, and returns results as html, md, or json.
It’s an asynchronous job API: initialise a job, upload your files to the returned URLs, start it, poll status, then download the extracted output.
See the Document Digitization guide for the complete upload → start → poll → download flow.
Translation and transliteration
For text (not speech), mayura:v1 (12 languages, colloquial modes, output-script and native-numeral control) and sarvam-translate:v1 (all 22 official Indian languages, formal) handle translation, and the Transliteration API converts between scripts. See the Translation guide for output_script and numerals_format options.