For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
CommunityAPI StatusAPI PricingSign Up
DocumentationAPI ReferencesCookbookIntegrationDeveloper Tools
DocumentationAPI ReferencesCookbookIntegrationDeveloper Tools
  • Getting Started
    • Welcome
    • Quickstart
    • SDKs & Libraries
    • Building for Indian Languages
    • Models
    • Credits & Rate Limits
    • Errors & Troubleshooting
    • Talk to us
    • Pricing
    • Changelog
  • API Guides & Tutorials
LogoLogo
CommunityAPI StatusAPI PricingSign Up
On this page
  • Language coverage at a glance
  • Speech-to-Text: transcribe real Indian audio
  • Telephony and 8kHz audio
  • Text-to-Speech: natural Indian voices
  • Pronunciation control
  • Document Digitization: Indic OCR with Sarvam Vision
  • Translation and transliteration
  • Where to go next
Getting Started

Building for Indian Languages

The concepts that make Indian-language speech and document apps work in production — code-mixing, scripts, telephony audio, voices, and Indic OCR.

||View as Markdown|
Was this page helpful?
Previous

Models

Next
Built with

Indian-language products have to handle realities that English-first stacks ignore: people mix English into every sentence, the same language is written in multiple scripts, a lot of audio arrives over 8kHz phone lines, and documents come as scans in regional scripts. This page maps those realities to the exact Sarvam speech, voice, and vision APIs — with code that’s been run against the live API.

Language coverage at a glance

Coverage differs by capability, so pick the model that matches your language set.

CapabilityModelLanguages
Speech-to-TextSaarika / Saaras v310+ Indian languages + English
Speech-to-Text-TranslateSaaras v3Indian languages → English
Text-to-SpeechBulbul v311 (10 Indian + English)
Document DigitizationSarvam Vision23 (22 Indian + English)
TranslationMayura v1 / Sarvam-Translate v112 / 22
Chat / reasoningSarvam-30B / 105BMultilingual (Indian + English)

Speech-to-Text: transcribe real Indian audio

saaras:v3 exposes a mode parameter that controls how speech is written down — this is where most Indic-specific decisions happen:

modeWhat you get
transcribeTranscription in the native script
translateEnglish translation of the speech
verbatimWord-for-word, including fillers and repetitions
translitRomanized (Latin-script) output
codemixNatural code-mixed text — English words stay in English

The difference is easiest to see on one clip. For audio of “मुझे flight book करनी है” (a typical Hinglish sentence), the same audio produces:

modeTranscript (live output)
transcribeमुझे फ्लाइट बुक करनी है।
codemixमुझे flight बुक करनी है।
translitMujhe flight book karni hai.
1from sarvamai import SarvamAI
2
3client = SarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")
4
5with open("audio.wav", "rb") as f:
6 response = client.speech_to_text.transcribe(
7 file=f,
8 model="saaras:v3",
9 mode="codemix", # keep English words in English
10 language_code="hi-IN",
11 )
12print(response.transcript) # मुझे flight बुक करनी है।

Use codemix for chat/agent transcripts that feel natural, transcribe for clean native-script records, and translit when a downstream system only handles Latin script. The REST endpoint handles clips up to 30 seconds — for longer audio use the Batch API.

Telephony and 8kHz audio

A large share of Indian voice traffic is phone audio: 8kHz, mono, often µ-law/A-law encoded. Two rules keep transcription quality high:

  1. Match the sample rate everywhere. For 8kHz audio, set sample_rate=8000 both when opening the streaming connection and when sending each audio chunk. Mismatched rates cause garbled output.
  2. Use a supported streaming codec. Streaming STT accepts WAV and raw PCM (pcm_s16le, pcm_l16, pcm_raw) only.

saaras:v3 is tuned for telephony, so prefer it for call audio.

1import asyncio
2import base64
3from sarvamai import AsyncSarvamAI
4
5# Load 8kHz call audio and base64-encode it
6with open("call_recording_8khz.wav", "rb") as f:
7 audio_chunk = base64.b64encode(f.read()).decode("utf-8")
8
9async def transcribe_call_audio():
10 client = AsyncSarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")
11 async with client.speech_to_text_streaming.connect(
12 model="saaras:v3",
13 mode="transcribe",
14 language_code="hi-IN",
15 sample_rate=8000, # match the phone line
16 input_audio_codec="pcm_s16le",
17 ) as ws:
18 await ws.transcribe(audio=audio_chunk, encoding="audio/wav", sample_rate=8000)
19 print(await ws.recv())
20
21asyncio.run(transcribe_call_audio())

See the Streaming STT guide for the full WebSocket lifecycle, VAD, reconnection, and voice-agent barge-in.

Text-to-Speech: natural Indian voices

bulbul:v3 speaks 11 languages (10 Indian + English) with 30+ voices, and it’s built for the way Indians actually write — so you usually don’t pre-process anything.

  • Pass code-mixed text directly. Mixed Hindi-English (“आपका OTP 4321 है। Please use it…”) is spoken naturally; no need to romanize or split it.
  • Normalize English words and numbers with enable_preprocessing=true when your text has lots of abbreviations or digits.
  • Pick a voice with speaker (e.g. shubh, priya, kavya) — see the voice list.
  • Output to a phone line by requesting a telephony codec — output_audio_codec supports mulaw and alaw alongside mp3, wav, opus, etc.
1from sarvamai import SarvamAI
2
3client = SarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")
4
5with open("otp.mp3", "wb") as f:
6 for chunk in client.text_to_speech.convert_stream(
7 text="आपका OTP 4321 है। Please use it within 5 minutes.",
8 target_language_code="hi-IN",
9 speaker="shubh",
10 model="bulbul:v3",
11 output_audio_codec="mp3",
12 ):
13 f.write(chunk)

Pronunciation control

Brand names, abbreviations, and regional terms (“NEFT”, “KYC”, a company name) don’t always come out right by default. Create a Pronunciation Dictionary with Bulbul v3 to pin how specific words are spoken, then pass its dict_id to any TTS call (REST, HTTP stream, or WebSocket):

1for chunk in client.text_to_speech.convert_stream(
2 text="NEFT transfer karein aur KYC complete karein",
3 target_language_code="hi-IN",
4 speaker="shubh",
5 model="bulbul:v3",
6 dict_id="p_5cb7faa6", # your pronunciation dictionary
7 output_audio_codec="mp3",
8):
9 ... # write/play chunk

Document Digitization: Indic OCR with Sarvam Vision

Indian documents arrive as scans, photos, and PDFs in regional scripts — often with tables and mixed languages. Sarvam Vision extracts text and structure across 23 languages (22 Indian + English), preserving the original script, and returns results as html, md, or json.

It’s an asynchronous job API: initialise a job, upload your files to the returned URLs, start it, poll status, then download the extracted output.

1from sarvamai import SarvamAI
2
3client = SarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")
4
5job = client.document_intelligence.initialise(
6 job_parameters={
7 "language": "hi-IN", # primary document language (BCP-47)
8 "output_format": "md", # "md", "html", or "json"
9 },
10)
11# Then: upload files -> client.document_intelligence.start(...) ->
12# poll get_status(...) -> get_download_links(...)

See the Document Digitization guide for the complete upload → start → poll → download flow.

Translation and transliteration

For text (not speech), mayura:v1 (12 languages, colloquial modes, output-script and native-numeral control) and sarvam-translate:v1 (all 22 official Indian languages, formal) handle translation, and the Transliteration API converts between scripts. See the Translation guide for output_script and numerals_format options.

Where to go next

  • Speech-to-Text overview · Streaming STT · Batch STT
  • Text-to-Speech overview · Pronunciation Dictionary
  • Document Digitization
  • Libraries & SDKs · Errors & Troubleshooting