Speech-to-Text APIs | Sarvam API Docs

Sarvam AI offers a powerful speech recognition model: Saaras v3 — state-of-the-art ASR with flexible output modes: transcribe, translate, verbatim, transliterate, and codemix.

Saaras v3 (Recommended)

State-of-the-art ASR model with flexible output modes: transcribe, translate, verbatim, transliterate, and codemix. Best choice for new integrations.

Choosing the model, endpoint & mode

Goal	Endpoint	Model	`mode`
Transcribe in the spoken language	`/speech-to-text`	`saaras:v3`	`transcribe`
Translate speech to English	`/speech-to-text`	`saaras:v3`	`translate`
Word-for-word (with fillers)	`/speech-to-text`	`saaras:v3`	`verbatim`
Romanized (Latin-script) output	`/speech-to-text`	`saaras:v3`	`translit`
Code-mixed output	`/speech-to-text`	`saaras:v3`	`codemix`
Legacy translate endpoint	`/speech-to-text-translate`	`saaras:v2.5`	—

The mode parameter is only supported by saaras:v3 on the /speech-to-text endpoint. The /speech-to-text-translate endpoint is legacy (saaras:v2.5); for new integrations, use /speech-to-text with mode="translate".

API Types

Available API types: REST API for synchronous processing (files under 30 seconds), Batch API for asynchronous processing (files up to 2 hours), and Streaming API for real-time audio with instant results.

REST API

Synchronous processing for files under 30 seconds.

Batch API

Asynchronous processing for files up to 2 hours.

Streaming API

Real-time audio streaming with instant results.

Not sure which one fits your audio length and latency needs? See Which Speech-to-Text API to Use for a side-by-side comparison of REST, WebSocket, and Batch.

Supported Audio Formats & MIME Types

The STT and STTT REST and Batch APIs support over 10 major audio formats and MIME type variants. Supported formats and MIME types are listed below:

Format Group	Supported MIME Types
MP3 Variants	`mpeg`, `mp3`, `mpeg3`, `x-mpeg-3`, `x-mp3`
WAV Variants	`wav`, `x-wav`, `wave`
AAC Variants	`aac`, `x-aac`
AIFF Variants	`aiff`, `x-aiff`
OGG / Opus Formats	`ogg`, `opus`
FLAC Variants (Lossless)	`flac`, `x-flac`
MP4 / M4A Audio	`mp4`, `x-m4a`
AMR (Narrowband)	`amr`
WMA (Windows Media Audio)	`x-ms-wma`
WEBM (Audio & Video)	`webm`, `webm`
PCM Formats	`pcm_s16le`, `pcm_l16`, `pcm_raw`

For most audio formats, our API automatically detects the codec. However, when using PCM formats (pcm_s16le, pcm_l16, pcm_raw), you must explicitly specify the input_audio_codec parameter. PCM files are only supported at 16kHz sample rate.

WebSocket/Streaming APIs: The STT and STTT WebSocket streaming APIs only support WAV and raw PCM formats (wav, pcm_s16le, pcm_l16, pcm_raw). Other audio formats are not supported for real-time streaming.

Technical Capabilities

Language Support

22 Indian languages (Saaras v3)
Automatic language detection
Code-mixing support
Multi-speaker handling

Advanced Processing

Speaker diarization (Batch API)
Timestamp generation
Entity preservation
Telephony optimization

Limits

Limit	Value
Real-time REST: max audio duration	30 seconds per request
Batch API: max file duration	2 hours per file
Batch API: max files per job	20
Batch API: diarization	Up to 20 speakers (`num_speakers`)
Streaming WebSocket: formats	WAV and raw PCM only (`wav`, `pcm_s16le`, `pcm_l16`, `pcm_raw`)
Streaming WebSocket: sample rate	16000 Hz (default) or 8000 Hz
Concurrency / rate limits	Per plan — see Rate Limits

Before uploading audio, run through the Preparing Your Audio checklist — sample rate, channels, format, and duration limits — to avoid the most common 400 errors.

Next Steps

Choose Your API

Select the appropriate API type based on your use case.

Get API Key

Go Live

Deploy your integration and monitor usage in the dashboard.

Need help choosing the right API? Contact us on discord for guidance.