Saaras
Saaras v3 is our state-of-the-art speech recognition model with flexible output formats. It supports multiple output modes including transcription, translation, verbatim, transliteration, and code-mixed outputs. Saaras is built to make Indic languages LLM-comprehensible, offering accurate transcriptions and translations across 23 languages (22 Indian languages + English).
Saaras v3 is the latest version with improved accuracy and performance. It is available in the Speech-to-Text endpoint (/speech-to-text) and supports multiple output modes via the mode parameter.
Output Modes
Saaras v3 supports multiple output modes via the mode parameter. Each mode produces different output formats for the same input audio.
Example audio: “मेरा फोन नंबर है 9840950950”
Key Features
Advanced prompting system for domain-specific translation and hotword retention, ensuring accurate context preservation.
Optimized for 8KHz telephony audio with enhanced multi-speaker recognition capabilities.
Preserves proper nouns and entities accurately across languages, maintaining context and meaning.
Supports 23 languages (22 Indian + English) with optional language identification.
Provides diarized outputs with precise timestamps for multi-speaker conversations through batch API.
Converts speech directly to English text, eliminating the need for separate transcription and translation steps.
Language Support
Saaras v3 supports 23 languages (22 Indian languages + English) with comprehensive dialect and accent coverage, including code-mixed audio support and intelligent proper noun preservation for speech-to-English translation.
Language codes are optional. When not specified or set to unknown, the model will automatically detect the input language and return a language_probability score indicating detection confidence.
Additional Capabilities:
- Includes dialects and accents of the above languages
- Code-mixed audio support
- Intelligent Proper Noun and Entity Preservation to ensure proper nouns, regional names, and entities are recognized and retained accurately during transcription
API Response Format
The Speech-to-Text API returns a JSON response with the following fields:
Example Response:
Key Capabilities
Transcribe Mode
Translate Mode
Verbatim Mode
Translit Mode
Codemix Mode
Standard transcription in the original language with proper formatting and number normalization. This is the default mode.
Next Steps
Learn how to integrate Saaras v3 into your application.
Complete API documentation for Speech-to-Text endpoint.
Step-by-step tutorial for speech-to-text transcription.
Saaras v2.5 (Deprecated Soon)
Deprecation Notice: Saaras v2.5 will be deprecated soon. We recommend migrating to Saaras v3 for improved accuracy and performance. The v2.5 model will continue to work during the transition period, but new features and improvements will only be available in v3.
About Saaras v2.5
Saaras v2.5 is the previous speech translation model available in the Speech-to-Text Translate endpoint (/speech-to-text-translate). It converts speech directly to English text with enhanced telephony support and intelligent entity preservation.
Key Difference: Saaras v2.5 uses the /speech-to-text-translate endpoint, while Saaras v3 uses the /speech-to-text endpoint with mode parameter support.
Key Features (v2.5)
Advanced prompting system for domain-specific translation and hotword retention, ensuring accurate context preservation.
Optimized for 8KHz telephony audio with enhanced multi-speaker recognition capabilities.
Preserves proper nouns and entities accurately across languages, maintaining context and meaning.
Supports 11 Indian languages with optional language identification.
Provides diarized outputs with precise timestamps for multi-speaker conversations through batch API.
Converts speech directly to English text, eliminating the need for separate transcription and translation steps.
Translation Quality (v2.5 Benchmarks)
COMET score, a robust metric for evaluating machine speech-translations, assesses semantic accuracy, fluency, and contextual relevance. Saaras v2.5 achieves exceptional performance on the Vistaar+Indicvoices Benchmark, a dataset curated from diverse Indian language audio sources, including code-mixed content, noisy environments, and regional accents.
COMET Score Performance:
- Across 11 Languages: 89.3%
- English: 94.62%
- Hindi: 91.83%
- 9 Other languages: 88.41%
Higher is better; Compared on VISTAAR + IndicVoices Benchmark
Why COMET? It evaluates not only lexical accuracy but also how well the translation captures meaning and context, critical for Indic languages with complex structures.
Dataset Description: Contains real-world, multi-accented speech samples that covers 10 major Indic languages, ensuring representation of India’s linguistic diversity. Includes code-mixed phrases, domain-specific vocabulary, and colloquial expressions.
v2.5 Usage Example
Python
JavaScript
cURL
Migration Guide
To migrate from Saaras v2.5 to v3:
- Change the endpoint: Switch from
/speech-to-text-translateto/speech-to-text - Update the model parameter: Change from
saaras:v2.5tosaaras:v3 - Add the mode parameter: Use
mode="translate"to get English output (similar to v2.5 behavior)
SDK Migration:
The response format remains compatible.