Building for Indian Languages

Indian-language products have to handle realities that English-first stacks ignore: people mix English into every sentence, the same language is written in multiple scripts, a lot of audio arrives over 8kHz phone lines, and documents come as scans in regional scripts. This page maps those realities to the exact Sarvam speech, voice, and vision APIs — with code that’s been run against the live API.

Language coverage at a glance

Coverage differs by capability, so pick the model that matches your language set.

Capability	Model	Languages
Speech-to-Text	Saarika v2.5 / Saaras v3	11 (10 Indian + English) / 23 (22 Indian + English)
Speech-to-Text-Translate	Saaras v3	23 (22 Indian + English) → English
Text-to-Speech	Bulbul v3	11 (10 Indian + English)
Document Digitization	Sarvam Vision	23 (22 Indian + English)
Translation	Mayura v1 / Sarvam-Translate v1	11 (10 Indian + English) / 23 (22 Indian + English)
Chat / reasoning	Sarvam-30B / 105B	11 (10 Indian + English)

Speech-to-Text: transcribe real Indian audio

saaras:v3 exposes a mode parameter that controls how speech is written down — this is where most Indic-specific decisions happen:

`mode`	What you get
`transcribe`	Transcription in the native script
`translate`	English translation of the speech
`verbatim`	Word-for-word, including fillers and repetitions
`translit`	Romanized (Latin-script) output
`codemix`	Natural code-mixed text — English words stay in English

The difference is easiest to see on one clip. For audio of “मुझे flight book करनी है” (a typical Hinglish sentence), the same audio produces:

`mode`	Transcript (live output)
`transcribe`	मुझे फ्लाइट बुक करनी है।
`codemix`	मुझे flight बुक करनी है।
`translit`	Mujhe flight book karni hai.

1 from sarvamai import SarvamAI
2 
3 client = SarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")
4 
5 with open("audio.wav", "rb") as f:
6     response = client.speech_to_text.transcribe(
7         file=f,
8         model="saaras:v3",
9         mode="codemix",          # keep English words in English
10         language_code="hi-IN",
11     )
12 print(response.transcript)       # मुझे flight बुक करनी है।

Use codemix for chat/agent transcripts that feel natural, transcribe for clean native-script records, and translit when a downstream system only handles Latin script. The REST endpoint handles clips up to 30 seconds — for longer audio use the Batch API.

Telephony and 8kHz audio

A large share of Indian voice traffic is phone audio: 8kHz, mono, often µ-law/A-law encoded. Two rules keep transcription quality high:

Match the sample rate everywhere. For 8kHz audio, set sample_rate=8000 both when opening the streaming connection and when sending each audio chunk. Mismatched rates cause garbled output.
Use a supported streaming codec. Streaming STT accepts WAV and raw PCM (pcm_s16le, pcm_l16, pcm_raw) only.

saaras:v3 is tuned for telephony, so prefer it for call audio.

1 import asyncio
2 import base64
3 from sarvamai import AsyncSarvamAI
4 
5 # Load 8kHz call audio and base64-encode it
6 with open("call_recording_8khz.wav", "rb") as f:
7     audio_chunk = base64.b64encode(f.read()).decode("utf-8")
8 
9 async def transcribe_call_audio():
10     client = AsyncSarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")
11     async with client.speech_to_text_streaming.connect(
12         model="saaras:v3",
13         mode="transcribe",
14         language_code="hi-IN",
15         sample_rate=8000,              # match the phone line
16         input_audio_codec="pcm_s16le",
17     ) as ws:
18         await ws.transcribe(audio=audio_chunk, encoding="audio/wav", sample_rate=8000)
19         print(await ws.recv())
20 
21 asyncio.run(transcribe_call_audio())

See the Streaming STT guide for the full WebSocket lifecycle, VAD, reconnection, and voice-agent barge-in.

Text-to-Speech: natural Indian voices

bulbul:v3 speaks 11 languages (10 Indian + English) with 30+ voices, and it’s built for the way Indians actually write — so you usually don’t pre-process anything.

Pass code-mixed text directly. Mixed Hindi-English (“आपका OTP 4321 है। Please use it…”) is spoken naturally; no need to romanize or split it.
Normalize English words and numbers with enable_preprocessing=true when your text has lots of abbreviations or digits.
Pick a voice with speaker (e.g. shubh, priya, kavya) — see the voice list.
Output to a phone line by requesting a telephony codec — output_audio_codec supports mulaw and alaw alongside mp3, wav, opus, etc.

1 from sarvamai import SarvamAI
2 
3 client = SarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")
4 
5 with open("otp.mp3", "wb") as f:
6     for chunk in client.text_to_speech.convert_stream(
7         text="आपका OTP 4321 है। Please use it within 5 minutes.",
8         target_language_code="hi-IN",
9         speaker="shubh",
10         model="bulbul:v3",
11         output_audio_codec="mp3",
12     ):
13         f.write(chunk)

Pronunciation control

Brand names, abbreviations, and regional terms (“NEFT”, “KYC”, a company name) don’t always come out right by default. Create a Pronunciation Dictionary with Bulbul v3 to pin how specific words are spoken, then pass its dict_id to any TTS call (REST, HTTP stream, or WebSocket):

1 for chunk in client.text_to_speech.convert_stream(
2     text="NEFT transfer karein aur KYC complete karein",
3     target_language_code="hi-IN",
4     speaker="shubh",
5     model="bulbul:v3",
6     dict_id="p_5cb7faa6",        # your pronunciation dictionary
7     output_audio_codec="mp3",
8 ):
9     ...  # write/play chunk

Document Digitization: Indic OCR with Sarvam Vision

Indian documents arrive as scans, photos, and PDFs in regional scripts — often with tables and mixed languages. Sarvam Vision extracts text and structure across 23 languages (22 Indian + English), preserving the original script, and returns results as html, md, or json.

It’s an asynchronous job API: initialise a job, upload your files to the returned URLs, start it, poll status, then download the extracted output.

1 from sarvamai import SarvamAI
2 
3 client = SarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")
4 
5 job = client.document_intelligence.create_job(
6     language="hi-IN",     # primary document language (BCP-47)
7     output_format="md",   # "md", "html", or "json"
8 )
9 job.upload_file("document.pdf")
10 job.start()
11 job.wait_until_complete()
12 job.download_output(output_path="./output")

See the Document Digitization guide for the complete upload → start → poll → download flow.

Translation and transliteration

For text (not speech), mayura:v1 (11 languages — 10 Indian + English, colloquial modes, output-script and native-numeral control) and sarvam-translate:v1 (23 languages — 22 Indian + English, formal) handle translation, and the Transliteration API converts between scripts. See the Translation guide for output_script and numerals_format options.