Saaras

Saaras is a high accuracy real-time speech recognition service optimized for a wide variety of audio inputs. It automatically detects the input language, transcribes the speech, and translates the transcript to English. Saaras is built to make Indic languages LLM-comprehensible, offering accurate English translated transcriptions across 10 major Indian languages.

Saaras-v2.5 is our flagship domain-aware speech recognition model, designed for production environments requiring high accuracy and robust performance. It specializes in speech-to-text translation, converting spoken content directly into English text while preserving context and meaning.

Key Features

Domain-Aware Translation

Advanced prompting system for domain-specific translation and hotword retention, ensuring accurate context preservation.

Superior Telephony Performance

Optimized for 8KHz telephony audio with enhanced multi-speaker recognition capabilities.

Intelligent Entity Preservation

Preserves proper nouns and entities accurately across languages, maintaining context and meaning.

Automatic Language Detection

Built-in Language Identification (LID) with confidence scores for automatic language detection.

Speaker Diarization

Provides diarized outputs with precise timestamps for multi-speaker conversations through batch API.

Direct Translation

Converts speech directly to English text, eliminating the need for separate transcription and translation steps.

Language Support

Saaras supports 11 languages: English, Hindi, Bengali, Tamil, Telugu, Kannada, Malayalam, Marathi, Gujarati, Punjabi, and Odia.

Languages (Code):

Hindi (hi-IN), Bengali (bn-IN), Tamil (ta-IN), Telugu (te-IN), Gujarati (gu-IN), Kannada (kn-IN), Malayalam (ml-IN), Marathi (mr-IN), Punjabi (pa-IN), Odia (od-IN), English (en-IN)

Additional Capabilities:

  • Includes dialects and accents of the above languages
  • Code-mixed audio support
  • Intelligent Proper Noun and Entity Preservation to ensure proper nouns, regional names, and entities are recognized and retained accurately during transcription

All of the above are supported for speech-to-English translation.

Translation Quality

COMET score, a robust metric for evaluating machine speech-translations, assesses semantic accuracy, fluency, and contextual relevance. Saaras achieves exceptional performance on the Vistaar+Indicvoices Benchmark, a dataset curated from diverse Indian language audio sources, including code-mixed content, noisy environments, and regional accents.

COMET Score Performance:

  • Across 11 Languages: 89.3%
  • English: 94.62%
  • Hindi: 91.83%
  • 9 Other languages: 88.41%

*Higher is better; Compared on VISTAAR + IndicVoices Benchmark

Why COMET? It evaluates not only lexical accuracy but also how well the translation captures meaning and context, critical for Indic languages with complex structures.

Dataset Description: Contains real-world, multi-accented speech samples that covers 10 major Indic languages, ensuring representation of India’s linguistic diversity. Includes code-mixed phrases, domain-specific vocabulary, and colloquial expressions.

10095908580COMET Score88.06Bengali94.62English88.79Global90.33Gujarati91.83Hindi88.3Kannada89.28Malayalam89.07Marathi89.77Odia86.39Punjabi86.45Tamil88.06TeluguLanguages

Saaras automatically detects the source language and translates it to English. No need to specify the source language - the model handles language identification automatically.

Key Capabilities

Basic speech-to-text translation with automatic language detection. Perfect for converting Indian language speech directly to English text.

1from sarvamai import SarvamAI
2
3client = SarvamAI(
4 api_subscription_key="YOUR_API_SUBSCRIPTION_KEY"
5)
6
7response = client.speech_to_text.translate(
8 file=open("audio.wav", "rb"),
9 model="saaras:v2.5"
10)
11
12print(response)

Next Steps