Saaras | Sarvam API Docs

Saaras v3 is our state-of-the-art speech recognition model with flexible output formats. It supports multiple output modes including transcription, translation, verbatim, transliteration, and code-mixed outputs. Saaras is built to make Indic languages LLM-comprehensible, offering accurate transcriptions and translations across 23 languages (22 Indian languages + English).

Saaras v3 is the latest version with improved accuracy and performance. It is available in the Speech-to-Text endpoint (/speech-to-text) and supports multiple output modes via the mode parameter.

Output Modes

Saaras v3 supports multiple output modes via the mode parameter. Each mode produces different output formats for the same input audio.

Example audio: “मेरा फोन नंबर है 9840950950”

Mode	Description	Example Output
`transcribe` (default)	Standard transcription in the original language with proper formatting and number normalization	`मेरा फोन नंबर है 9840950950`
`translate`	Translates speech from any supported Indic language to English	`My phone number is 9840950950`
`verbatim`	Exact word-for-word transcription without normalization, preserving filler words and spoken numbers as-is	`मेरा फोन नंबर है नौ आठ चार zero नौ पांच zero नौ पांच zero`
`translit`	Romanization - Transliterates speech to Latin/Roman script	`mera phone number hai 9840950950`
`codemix`	Code-mixed text with English words in English and Indic words in native script	`मेरा phone number है 9840950950`

Key Features

Domain-Aware Translation

Advanced prompting system for domain-specific translation and hotword retention, ensuring accurate context preservation.

Superior Telephony Performance

Optimized for 8KHz telephony audio with enhanced multi-speaker recognition capabilities.

Intelligent Entity Preservation

Preserves proper nouns and entities accurately across languages, maintaining context and meaning.

Multi-Language Support

Supports 23 languages (22 Indian + English) with optional language identification.

Speaker Diarization

Provides diarized outputs with precise timestamps for multi-speaker conversations through batch API.

Direct Translation

Converts speech directly to English text, eliminating the need for separate transcription and translation steps.

Language Support

Saaras v3 supports 23 languages (22 Indian languages + English) with comprehensive dialect and accent coverage, including code-mixed audio support and intelligent proper noun preservation for speech-to-English translation.

Language	Language Code	Language	Language Code
Hindi	`hi-IN`	Assamese	`as-IN`
Bengali	`bn-IN`	Urdu	`ur-IN`
Kannada	`kn-IN`	Nepali	`ne-IN`
Malayalam	`ml-IN`	Konkani	`kok-IN`
Marathi	`mr-IN`	Kashmiri	`ks-IN`
Odia	`od-IN`	Sindhi	`sd-IN`
Punjabi	`pa-IN`	Sanskrit	`sa-IN`
Tamil	`ta-IN`	Santali	`sat-IN`
Telugu	`te-IN`	Manipuri	`mni-IN`
English	`en-IN`	Bodo	`brx-IN`
Gujarati	`gu-IN`	Maithili	`mai-IN`
		Dogri	`doi-IN`

Language codes are optional. When not specified or set to unknown, the model will automatically detect the input language and return a language_probability score indicating detection confidence.

Additional Capabilities:

Includes dialects and accents of the above languages
Code-mixed audio support
Intelligent Proper Noun and Entity Preservation to ensure proper nouns, regional names, and entities are recognized and retained accurately during transcription

API Response Format

The Speech-to-Text API returns a JSON response with the following fields:

Field	Type	Description
`request_id`	`string`	Unique identifier for the API request.
`transcript`	`string`	The transcribed text from the provided audio file.
`timestamps`	`object or null`	Contains word-level timestamps (`start_time_seconds`, `end_time_seconds`, `words`). Only included when `with_timestamps` is set to `true`.
`diarized_transcript`	`object or null`	Diarized transcript with speaker labels. Available through batch API.
`language_code`	`string or null`	BCP-47 code of the detected language (e.g., `hi-IN`). Returns the most predominant language if multiple are detected. Returns `null` if no language is detected.
`language_probability`	`number or null`	Float value (0.0 to 1.0) indicating the probability of the detected language being correct. Higher values indicate higher confidence. Returns a value when `language_code` is not provided or set to `unknown`. Returns `null` when a specific `language_code` is provided (language detection is skipped). Always present in the response.

Example Response:

1 {
2   "request_id": "20260209_abc123-def4-5678-ghij-klmnopqrstuv",
3   "transcript": "नमस्ते, आप कैसे हैं?",
4   "timestamps": null,
5   "diarized_transcript": null,
6   "language_code": "hi-IN",
7   "language_probability": 0.95
8 }

Key Capabilities

Transcribe Mode

Translate Mode

Verbatim Mode

Translit Mode

Codemix Mode

Standard transcription in the original language with proper formatting and number normalization. This is the default mode.

1 from sarvamai import SarvamAI
2 
3 client = SarvamAI(
4     api_subscription_key="YOUR_SARVAM_API_KEY"
5 )
6 
7 response = client.speech_to_text.transcribe(
8     file=open("audio.wav", "rb"),
9     model="saaras:v3",
10     mode="transcribe"  # default mode
11 )
12 
13 print(response)
14 # Output: मेरा फोन नंबर है 9840950950

Next Steps

Developer quickstart

Learn how to integrate Saaras v3 into your application.

API Reference

Complete API documentation for Speech-to-Text endpoint.

Cookbook

Step-by-step tutorial for speech-to-text transcription.

Saaras v2.5 (Deprecated Soon)

Deprecation Notice: Saaras v2.5 will be deprecated soon. We recommend migrating to Saaras v3 for improved accuracy and performance. The v2.5 model will continue to work during the transition period, but new features and improvements will only be available in v3.

About Saaras v2.5

Saaras v2.5 is the previous speech translation model available in the Speech-to-Text Translate endpoint (/speech-to-text-translate). It converts speech directly to English text with enhanced telephony support and intelligent entity preservation.

Key Difference: Saaras v2.5 uses the /speech-to-text-translate endpoint, while Saaras v3 uses the /speech-to-text endpoint with mode parameter support.

Key Features (v2.5)

Domain-Aware Translation

Advanced prompting system for domain-specific translation and hotword retention, ensuring accurate context preservation.

Superior Telephony Performance

Optimized for 8KHz telephony audio with enhanced multi-speaker recognition capabilities.

Intelligent Entity Preservation

Preserves proper nouns and entities accurately across languages, maintaining context and meaning.

Multi-Language Support

Supports 11 Indian languages with optional language identification.

Speaker Diarization

Provides diarized outputs with precise timestamps for multi-speaker conversations through batch API.

Direct Translation

Converts speech directly to English text, eliminating the need for separate transcription and translation steps.

Translation Quality (v2.5 Benchmarks)

COMET score, a robust metric for evaluating machine speech-translations, assesses semantic accuracy, fluency, and contextual relevance. Saaras v2.5 achieves exceptional performance on the Vistaar+Indicvoices Benchmark, a dataset curated from diverse Indian language audio sources, including code-mixed content, noisy environments, and regional accents.

COMET Score Performance:

Across 11 Languages: 89.3%
English: 94.62%
Hindi: 91.83%
9 Other languages: 88.41%

Higher is better; Compared on VISTAAR + IndicVoices Benchmark

Why COMET? It evaluates not only lexical accuracy but also how well the translation captures meaning and context, critical for Indic languages with complex structures.

Dataset Description: Contains real-world, multi-accented speech samples that covers 10 major Indic languages, ensuring representation of India’s linguistic diversity. Includes code-mixed phrases, domain-specific vocabulary, and colloquial expressions.

v2.5 Usage Example

Python

JavaScript

cURL

1 from sarvamai import SarvamAI
2 
3 client = SarvamAI(
4     api_subscription_key="YOUR_SARVAM_API_KEY"
5 )
6 
7 # Using deprecated v2.5 model
8 response = client.speech_to_text.translate(
9     file=open("audio.wav", "rb"),
10     model="saaras:v2.5"  # Deprecated - migrate to saaras:v3
11 )
12 
13 print(response)

Migration Guide

To migrate from Saaras v2.5 to v3:

Change the endpoint: Switch from /speech-to-text-translate to /speech-to-text
Update the model parameter: Change from saaras:v2.5 to saaras:v3
Add the mode parameter: Use mode="translate" to get English output (similar to v2.5 behavior)

1 # Endpoint change
2 - POST /speech-to-text-translate
3 + POST /speech-to-text
4 
5 # Parameter changes
6 - model="saaras:v2.5"
7 + model="saaras:v3"
8 + mode="translate"

SDK Migration:

1 # Python
2 - response = client.speech_to_text.translate(file, model="saaras:v2.5")
3 + response = client.speech_to_text.transcribe(file, model="saaras:v3", mode="translate")
4 
5 # JavaScript
6 - const response = await client.speechToText.translate(file, { model: "saaras:v2.5" });
7 + const response = await client.speechToText.transcribe(file, { model: "saaras:v3", mode: "translate" });

The response format remains compatible.