Saaras

Saaras v3 is our state-of-the-art speech recognition model with flexible output formats. It supports multiple output modes including transcription, translation, verbatim, transliteration, and code-mixed outputs. Saaras is built to make Indic languages LLM-comprehensible, offering accurate transcriptions and translations across 23 languages (22 Indian languages + English).

Saaras v3 is the latest version with improved accuracy and performance. It is available in the Speech-to-Text endpoint (/speech-to-text) and supports multiple output modes via the mode parameter.

Output Modes

Saaras v3 supports multiple output modes via the mode parameter. Each mode produces different output formats for the same input audio.

Example audio: “मेरा फोन नंबर है 9840950950”

ModeDescriptionExample Output
transcribe (default)Standard transcription in the original language with proper formatting and number normalizationमेरा फोन नंबर है 9840950950
translateTranslates speech from any supported Indic language to EnglishMy phone number is 9840950950
verbatimExact word-for-word transcription without normalization, preserving filler words and spoken numbers as-isमेरा फोन नंबर है नौ आठ चार zero नौ पांच zero नौ पांच zero
translitRomanization - Transliterates speech to Latin/Roman scriptmera phone number hai 9840950950
codemixCode-mixed text with English words in English and Indic words in native scriptमेरा phone number है 9840950950

Key Features

Domain-Aware Translation

Advanced prompting system for domain-specific translation and hotword retention, ensuring accurate context preservation.

Superior Telephony Performance

Optimized for 8KHz telephony audio with enhanced multi-speaker recognition capabilities.

Intelligent Entity Preservation

Preserves proper nouns and entities accurately across languages, maintaining context and meaning.

Multi-Language Support

Supports 23 languages (22 Indian + English) with optional language identification.

Speaker Diarization

Provides diarized outputs with precise timestamps for multi-speaker conversations through batch API.

Direct Translation

Converts speech directly to English text, eliminating the need for separate transcription and translation steps.

Language Support

Saaras v3 supports 23 languages (22 Indian languages + English) with comprehensive dialect and accent coverage, including code-mixed audio support and intelligent proper noun preservation for speech-to-English translation.

LanguageLanguage CodeLanguageLanguage Code
Hindihi-INAssameseas-IN
Bengalibn-INUrduur-IN
Kannadakn-INNepaline-IN
Malayalamml-INKonkanikok-IN
Marathimr-INKashmiriks-IN
Odiaod-INSindhisd-IN
Punjabipa-INSanskritsa-IN
Tamilta-INSantalisat-IN
Telugute-INManipurimni-IN
Englishen-INBodobrx-IN
Gujaratigu-INMaithilimai-IN
Dogridoi-IN

Language codes are optional. When not specified or set to unknown, the model will automatically detect the input language and return a language_probability score indicating detection confidence.

Additional Capabilities:

  • Includes dialects and accents of the above languages
  • Code-mixed audio support
  • Intelligent Proper Noun and Entity Preservation to ensure proper nouns, regional names, and entities are recognized and retained accurately during transcription

API Response Format

The Speech-to-Text API returns a JSON response with the following fields:

FieldTypeDescription
request_idstringUnique identifier for the API request.
transcriptstringThe transcribed text from the provided audio file.
timestampsobject or nullContains word-level timestamps (start_time_seconds, end_time_seconds, words). Only included when with_timestamps is set to true.
diarized_transcriptobject or nullDiarized transcript with speaker labels. Available through batch API.
language_codestring or nullBCP-47 code of the detected language (e.g., hi-IN). Returns the most predominant language if multiple are detected. Returns null if no language is detected.
language_probabilitynumber or nullFloat value (0.0 to 1.0) indicating the probability of the detected language being correct. Higher values indicate higher confidence. Returns a value when language_code is not provided or set to unknown. Returns null when a specific language_code is provided (language detection is skipped). Always present in the response.

Example Response:

1{
2 "request_id": "20260209_abc123-def4-5678-ghij-klmnopqrstuv",
3 "transcript": "नमस्ते, आप कैसे हैं?",
4 "timestamps": null,
5 "diarized_transcript": null,
6 "language_code": "hi-IN",
7 "language_probability": 0.95
8}

Key Capabilities

Standard transcription in the original language with proper formatting and number normalization. This is the default mode.

1from sarvamai import SarvamAI
2
3client = SarvamAI(
4 api_subscription_key="YOUR_SARVAM_API_KEY"
5)
6
7response = client.speech_to_text.transcribe(
8 file=open("audio.wav", "rb"),
9 model="saaras:v3",
10 mode="transcribe" # default mode
11)
12
13print(response)
14# Output: मेरा फोन नंबर है 9840950950

Next Steps


Deprecation Notice: Saaras v2.5 will be deprecated soon. We recommend migrating to Saaras v3 for improved accuracy and performance. The v2.5 model will continue to work during the transition period, but new features and improvements will only be available in v3.

About Saaras v2.5

Saaras v2.5 is the previous speech translation model available in the Speech-to-Text Translate endpoint (/speech-to-text-translate). It converts speech directly to English text with enhanced telephony support and intelligent entity preservation.

Key Difference: Saaras v2.5 uses the /speech-to-text-translate endpoint, while Saaras v3 uses the /speech-to-text endpoint with mode parameter support.

Key Features (v2.5)

Domain-Aware Translation

Advanced prompting system for domain-specific translation and hotword retention, ensuring accurate context preservation.

Superior Telephony Performance

Optimized for 8KHz telephony audio with enhanced multi-speaker recognition capabilities.

Intelligent Entity Preservation

Preserves proper nouns and entities accurately across languages, maintaining context and meaning.

Multi-Language Support

Supports 11 Indian languages with optional language identification.

Speaker Diarization

Provides diarized outputs with precise timestamps for multi-speaker conversations through batch API.

Direct Translation

Converts speech directly to English text, eliminating the need for separate transcription and translation steps.

Translation Quality (v2.5 Benchmarks)

COMET score, a robust metric for evaluating machine speech-translations, assesses semantic accuracy, fluency, and contextual relevance. Saaras v2.5 achieves exceptional performance on the Vistaar+Indicvoices Benchmark, a dataset curated from diverse Indian language audio sources, including code-mixed content, noisy environments, and regional accents.

COMET Score Performance:

  • Across 11 Languages: 89.3%
  • English: 94.62%
  • Hindi: 91.83%
  • 9 Other languages: 88.41%

Higher is better; Compared on VISTAAR + IndicVoices Benchmark

Why COMET? It evaluates not only lexical accuracy but also how well the translation captures meaning and context, critical for Indic languages with complex structures.

Dataset Description: Contains real-world, multi-accented speech samples that covers 10 major Indic languages, ensuring representation of India’s linguistic diversity. Includes code-mixed phrases, domain-specific vocabulary, and colloquial expressions.

10095908580COMET Score88.06Bengali94.62English90.33Gujarati91.83Hindi88.3Kannada89.28Malayalam89.07Marathi89.77Odia86.39Punjabi86.45Tamil88.06TeluguLanguages

v2.5 Usage Example

1from sarvamai import SarvamAI
2
3client = SarvamAI(
4 api_subscription_key="YOUR_SARVAM_API_KEY"
5)
6
7# Using deprecated v2.5 model
8response = client.speech_to_text.translate(
9 file=open("audio.wav", "rb"),
10 model="saaras:v2.5" # Deprecated - migrate to saaras:v3
11)
12
13print(response)

Migration Guide

To migrate from Saaras v2.5 to v3:

  1. Change the endpoint: Switch from /speech-to-text-translate to /speech-to-text
  2. Update the model parameter: Change from saaras:v2.5 to saaras:v3
  3. Add the mode parameter: Use mode="translate" to get English output (similar to v2.5 behavior)
1# Endpoint change
2- POST /speech-to-text-translate
3+ POST /speech-to-text
4
5# Parameter changes
6- model="saaras:v2.5"
7+ model="saaras:v3"
8+ mode="translate"

SDK Migration:

1# Python
2- response = client.speech_to_text.translate(file, model="saaras:v2.5")
3+ response = client.speech_to_text.transcribe(file, model="saaras:v3", mode="translate")
4
5# JavaScript
6- const response = await client.speechToText.translate(file, { model: "saaras:v2.5" });
7+ const response = await client.speechToText.transcribe(file, { model: "saaras:v3", mode: "translate" });

The response format remains compatible.