Saarika | Sarvam API Docs

Saarika is a high accuracy real-time speech recognition service optimized for a wide variety of audio inputs. It automatically detects the input language, transcribes the speech and outputs the transcript in the original language.

Saarika-v2.5 is our flagship speech recognition model, specifically designed for Indian languages and accents. It excels in handling complex multi-speaker conversations, telephony audio, and code-mixed speech with superior accuracy.

Key Features

Superior Telephony Performance

Optimized for 8KHz telephony audio with enhanced noise handling and superior multi-speaker recognition capabilities.

Intelligent Entity Preservation

Preserves proper nouns and entities accurately across languages, maintaining context and meaning in transcriptions.

Automatic Language Detection

Optional automatic language identification with LID output. Use “unknown” when language is not known for automatic detection.

Speaker Diarization

Provides diarized outputs with precise timestamps for multi-speaker conversations through batch API processing.

Automatic Code Mixing

Intelligently handles mid-sentence language switches in code-mixed speech, perfect for India’s multilingual conversations.

Multi-Language Support

Comprehensive support for Indian languages with high accuracy in mixed-language environments.

Language Support

Saarika supports 11 languages: English, Hindi, Bengali, Tamil, Telugu, Kannada, Malayalam, Marathi, Gujarati, Punjabi, and Odia.

Languages (Code):

Hindi (hi-IN), Bengali (bn-IN), Tamil (ta-IN), Telugu (te-IN), Gujarati (gu-IN), Kannada (kn-IN), Malayalam (ml-IN), Marathi (mr-IN), Punjabi (pa-IN), Odia (od-IN), English (en-IN)

Key Language Features

English includes all dialects and accents
Includes dialects and accents of the above languages
Code-mixed audio support - Handles mixed-language content seamlessly
Intelligent Proper Noun and Entity Preservation - Ensures proper nouns, regional names, and entities are recognized and retained accurately during transcription

Supports native script output in all of the above languages.

For automatic language detection, use language_code="unknown". The model will automatically identify the spoken language and return it in the response.

Performance Benchmarks

Saarika delivers exceptional accuracy across all supported languages, as measured on the VISTAAR Benchmark.

CER (Character Error Rate) Scores

Lower is better - Compared on VISTAAR Benchmark

Across 11 Languages: 4.96%
English: 4.45%
Hindi: 4.42%
9 Other languages: 5.07%

WER (Word Error Rate) Scores

Lower is better - Compared on VISTAAR Benchmark

Across 11 Languages: 18.32%
English: 8.26%
Hindi: 11.81%
9 Other languages: 20.15%

Detailed CER Performance by Language

CER (Character Error Rate) measures the percentage of characters that are wrong in a transcription. Lower scores are better, with 0% being perfect.

Key Capabilities

Basic Usage

Code-Mixed Speech

Automatic Language Detection

Basic transcription with specified language code. Perfect for single-language content with clear audio quality.

Python

JavaScript

cURL

1 from sarvamai import SarvamAI
2 
3 client = SarvamAI(
4     api_subscription_key="YOUR_API_SUBSCRIPTION_KEY"
5 )
6 
7 response = client.speech_to_text.transcribe(
8     file=open("audio.wav", "rb"),
9     model="saarika:v2.5",
10     language_code="hi-IN"
11 )
12 
13 print(response)

Next Steps

Developer quickstart

Learn how to integrate speech to text into your application.

API Reference

Complete API documentation for speech to text endpoints.