Which Speech-to-Text API to Use
Which Speech-to-Text API to Use
Sarvam gives you three ways to run speech recognition on the same models: the REST, WebSocket, and Batch APIs. They differ in how you send audio, how fast you get results, the maximum audio they accept, and which features (diarization, timestamps) are available. Use this page to pick one before you start integrating.
Every transport has a Speech-to-Text-Translate counterpart that returns English instead of the source language. The transport trade-offs below are identical — only the output language changes. See Speech-to-Text-Translate.
Quick decision
A short clip (≤30s) and you want the transcript back in one call.
Live microphone or call audio that needs results as the user speaks.
Long recordings (up to 2 hours), with diarization and timestamps.
Comparison
When to use each
REST — POST /speech-to-text
- The audio is already captured and short (≤30 seconds).
- You want one request, one response — no connection to manage.
- Examples: voice search, push-to-talk commands, transcribing a short voice note.
- REST API guide →
WebSocket — GET /speech-to-text/ws
- Audio arrives continuously from a mic, browser, or telephony stream.
- You need transcripts as the user speaks (live captions, barge-in for voice agents).
- Note: only WAV / raw PCM is accepted, and results are final per utterance — there are no interim
is_finalpartials. See finalization semantics. - Streaming API guide →
Batch — POST /speech-to-text/job/v1
- Recordings are long (up to 2 hours) or you have many files (up to 20 per job).
- You need speaker diarization and chunk-level timestamps (e.g. subtitles, meeting minutes).
- Latency isn’t critical — you submit a job and download results when it finishes.
- Batch API guide →
Need English output regardless of the spoken language? Use the same transport on Speech-to-Text-Translate (/speech-to-text-translate, /speech-to-text-translate/ws, /speech-to-text-translate/job/v1) with mode="translate" on Saaras v3.