Which Speech-to-Text API to Use

Sarvam gives you three ways to run speech recognition on the same models: the REST, WebSocket, and Batch APIs. They differ in how you send audio, how fast you get results, the maximum audio they accept, and which features (diarization, timestamps) are available. Use this page to pick one before you start integrating.

Every transport has a Speech-to-Text-Translate counterpart that returns English instead of the source language. The transport trade-offs below are identical — only the output language changes. See Speech-to-Text-Translate.

Quick decision

REST

A short clip (≤30s) and you want the transcript back in one call.

WebSocket

Live microphone or call audio that needs results as the user speaks.

Batch

Long recordings (up to 2 hours), with diarization and timestamps.

Comparison

	REST	WebSocket	Batch
Endpoint	`POST /speech-to-text`	`GET /speech-to-text/ws`	`POST /speech-to-text/job/v1` (job flow)
Processing	Synchronous	Real-time streaming	Asynchronous (job)
Max audio length	30 seconds	Continuous (chunked)	2 hours per file
Files per request	1	1 stream	Up to 20 per job
Results	Final transcript in the response	Final transcript per utterance (on VAD end-of-speech or `flush()`)	Final transcript, downloaded when the job completes
Latency	One round-trip after upload	Lowest — results arrive while audio is still streaming	Highest — minutes, depending on queue and duration
Speaker diarization	No	No	Yes
Timestamps	No	No	Yes (chunk-level)
Audio formats	All supported formats (auto-detected; PCM at 16 kHz)	WAV and raw PCM only (`wav`, `pcm_s16le`, `pcm_l16`, `pcm_raw`)	All supported formats
Output modes (Saaras v3)	`transcribe`, `translate`, `verbatim`, `translit`, `codemix`	`transcribe`, `translate`, `verbatim`, `translit`, `codemix`	`transcribe`, `translate`, `verbatim`, `translit`, `codemix`
Best for	Short clips, voice commands, quick tests	Voice agents, live captions, call streaming	Meetings, interviews, call-center recordings, bulk pipelines

When to use each

REST — POST /speech-to-text

The audio is already captured and short (≤30 seconds).
You want one request, one response — no connection to manage.
Examples: voice search, push-to-talk commands, transcribing a short voice note.
REST API guide →

WebSocket — GET /speech-to-text/ws

Audio arrives continuously from a mic, browser, or telephony stream.
You need transcripts as the user speaks (live captions, barge-in for voice agents).
Note: only WAV / raw PCM is accepted, and results are final per utterance — there are no interim is_final partials. See finalization semantics.
Streaming API guide →

Batch — POST /speech-to-text/job/v1

Recordings are long (up to 2 hours) or you have many files (up to 20 per job).
You need speaker diarization and chunk-level timestamps (e.g. subtitles, meeting minutes).
Latency isn’t critical — you submit a job and download results when it finishes.
Batch API guide →

Need English output regardless of the spoken language? Use the same transport on Speech-to-Text-Translate (/speech-to-text-translate, /speech-to-text-translate/ws, /speech-to-text-translate/job/v1) with mode="translate" on Saaras v3.

Quick decision

Comparison

When to use each

Related