Saaras v3 is our state-of-the-art speech recognition model with flexible output formats. It supports multiple output modes including transcription, translation, verbatim, transliteration, and code-mixed outputs. Saaras is built to make Indic languages LLM-comprehensible, offering accurate transcriptions and translations across 23 languages (22 Indian languages + English).
Saaras v3 is the latest version with improved accuracy and performance. It is available in the Speech-to-Text endpoint (/speech-to-text) and supports multiple output modes via the mode parameter.
Saaras v3 supports multiple output modes via the mode parameter. Each mode produces different output formats for the same input audio.
Example audio: “मेरा फोन नंबर है 9840950950”
Advanced prompting system for domain-specific translation and hotword retention, ensuring accurate context preservation.
Optimized for 8KHz telephony audio with enhanced multi-speaker recognition capabilities.
Preserves proper nouns and entities accurately across languages, maintaining context and meaning.
Supports 23 languages (22 Indian + English) with optional language identification.
Provides diarized outputs with precise timestamps for multi-speaker conversations through batch API.
Converts speech directly to English text, eliminating the need for separate transcription and translation steps.
Saaras v3 supports 23 languages (22 Indian languages + English) with comprehensive dialect and accent coverage, including code-mixed audio support and intelligent proper noun preservation for speech-to-English translation.
Language codes are optional. When not specified or set to unknown, the model will automatically detect the input language and return a language_probability score indicating detection confidence.
Additional Capabilities:
The Speech-to-Text API returns a JSON response with the following fields:
Example Response:
Standard transcription in the original language with proper formatting and number normalization. This is the default mode.
Learn how to integrate Saaras v3 into your application.
Complete API documentation for Speech-to-Text endpoint.
Step-by-step tutorial for speech-to-text transcription.
Deprecation Notice: Saaras v2.5 will be deprecated soon. We recommend migrating to Saaras v3 for improved accuracy and performance. The v2.5 model will continue to work during the transition period, but new features and improvements will only be available in v3.
Saaras v2.5 is the previous speech translation model available in the Speech-to-Text Translate endpoint (/speech-to-text-translate). It converts speech directly to English text with enhanced telephony support and intelligent entity preservation.
Key Difference: Saaras v2.5 uses the /speech-to-text-translate endpoint, while Saaras v3 uses the /speech-to-text endpoint with mode parameter support.
Advanced prompting system for domain-specific translation and hotword retention, ensuring accurate context preservation.
Optimized for 8KHz telephony audio with enhanced multi-speaker recognition capabilities.
Preserves proper nouns and entities accurately across languages, maintaining context and meaning.
Supports 11 Indian languages with optional language identification.
Provides diarized outputs with precise timestamps for multi-speaker conversations through batch API.
Converts speech directly to English text, eliminating the need for separate transcription and translation steps.
COMET score, a robust metric for evaluating machine speech-translations, assesses semantic accuracy, fluency, and contextual relevance. Saaras v2.5 achieves exceptional performance on the Vistaar+Indicvoices Benchmark, a dataset curated from diverse Indian language audio sources, including code-mixed content, noisy environments, and regional accents.
COMET Score Performance:
Higher is better; Compared on VISTAAR + IndicVoices Benchmark
Why COMET? It evaluates not only lexical accuracy but also how well the translation captures meaning and context, critical for Indic languages with complex structures.
Dataset Description: Contains real-world, multi-accented speech samples that covers 10 major Indic languages, ensuring representation of India’s linguistic diversity. Includes code-mixed phrases, domain-specific vocabulary, and colloquial expressions.
To migrate from Saaras v2.5 to v3:
/speech-to-text-translate to /speech-to-textsaaras:v2.5 to saaras:v3mode="translate" to get English output (similar to v2.5 behavior)SDK Migration:
The response format remains compatible.