Speech-to-Text APIs

View as Markdown

Sarvam AI offers powerful speech recognition models: Saaras v3 (recommended — state-of-the-art ASR with flexible output modes: transcribe, translate, verbatim, transliterate, and codemix) and Saarika v2.5 (legacy model, will be deprecated — migrate to Saaras v3).

API Types

Available API types: REST API for synchronous processing (files under 30 seconds), Batch API for asynchronous processing (files up to 2 hours), and Streaming API for real-time audio with instant results.

Not sure which one fits your audio length and latency needs? See Which Speech-to-Text API to Use for a side-by-side comparison of REST, WebSocket, and Batch.

Supported Audio Formats & MIME Types

The STT and STTT REST and Batch APIs support over 10 major audio formats and MIME type variants. Supported formats and MIME types are listed below:

Format GroupSupported MIME Types
MP3 Variantsmpeg, mp3, mpeg3, x-mpeg-3, x-mp3
WAV Variantswav, x-wav, wave
AAC Variantsaac, x-aac
AIFF Variantsaiff, x-aiff
OGG / Opus Formatsogg, opus
FLAC Variants (Lossless)flac, x-flac
MP4 / M4A Audiomp4, x-m4a
AMR (Narrowband)amr
WMA (Windows Media Audio)x-ms-wma
WEBM (Audio & Video)webm, webm
PCM Formatspcm_s16le, pcm_l16, pcm_raw

For most audio formats, our API automatically detects the codec. However, when using PCM formats (pcm_s16le, pcm_l16, pcm_raw), you must explicitly specify the input_audio_codec parameter. PCM files are only supported at 16kHz sample rate.

WebSocket/Streaming APIs: The STT and STTT WebSocket streaming APIs only support WAV and raw PCM formats (wav, pcm_s16le, pcm_l16, pcm_raw). Other audio formats are not supported for real-time streaming.


Technical Capabilities

Language Support
  • 22 Indian languages (Saaras v3)
  • Automatic language detection
  • Code-mixing support
  • Multi-speaker handling
Advanced Processing
  • Speaker diarization (Batch API)
  • Timestamp generation
  • Entity preservation
  • Telephony optimization

Limits

LimitValue
Real-time REST: max audio duration30 seconds per request
Batch API: max file duration2 hours per file
Batch API: max files per job20
Batch API: diarizationUp to 20 speakers (num_speakers)
Streaming WebSocket: formatsWAV and raw PCM only (wav, pcm_s16le, pcm_l16, pcm_raw)
Streaming WebSocket: sample rate16000 Hz (default) or 8000 Hz
Concurrency / rate limitsPer plan — see Rate Limits

Before uploading audio, run through the Preparing Your Audio checklist — sample rate, channels, format, and duration limits — to avoid the most common 400 errors.

Next Steps

1

Choose Your API

Select the appropriate API type based on your use case.

2

Get API Key

Sign up and get your API key from the dashboard.

3

Go Live

Deploy your integration and monitor usage in the dashboard.

Need help choosing the right API? Contact us on discord for guidance.