Which Speech-to-Text API to Use

View as Markdown

Sarvam gives you three ways to run speech recognition on the same models: the REST, WebSocket, and Batch APIs. They differ in how you send audio, how fast you get results, the maximum audio they accept, and which features (diarization, timestamps) are available. Use this page to pick one before you start integrating.

Every transport has a Speech-to-Text-Translate counterpart that returns English instead of the source language. The transport trade-offs below are identical — only the output language changes. See Speech-to-Text-Translate.

Quick decision

Comparison

RESTWebSocketBatch
EndpointPOST /speech-to-textGET /speech-to-text/wsPOST /speech-to-text/job/v1 (job flow)
ProcessingSynchronousReal-time streamingAsynchronous (job)
Max audio length30 secondsContinuous (chunked)2 hours per file
Files per request11 streamUp to 20 per job
ResultsFinal transcript in the responseFinal transcript per utterance (on VAD end-of-speech or flush())Final transcript, downloaded when the job completes
LatencyOne round-trip after uploadLowest — results arrive while audio is still streamingHighest — minutes, depending on queue and duration
Speaker diarizationNoNoYes
TimestampsNoNoYes (chunk-level)
Audio formatsAll supported formats (auto-detected; PCM at 16 kHz)WAV and raw PCM only (wav, pcm_s16le, pcm_l16, pcm_raw)All supported formats
Output modes (Saaras v3)transcribe, translate, verbatim, translit, codemixtranscribe, translate, verbatim, translit, codemixtranscribe, translate, verbatim, translit, codemix
Best forShort clips, voice commands, quick testsVoice agents, live captions, call streamingMeetings, interviews, call-center recordings, bulk pipelines

When to use each

REST — POST /speech-to-text

  • The audio is already captured and short (≤30 seconds).
  • You want one request, one response — no connection to manage.
  • Examples: voice search, push-to-talk commands, transcribing a short voice note.
  • REST API guide →

WebSocket — GET /speech-to-text/ws

  • Audio arrives continuously from a mic, browser, or telephony stream.
  • You need transcripts as the user speaks (live captions, barge-in for voice agents).
  • Note: only WAV / raw PCM is accepted, and results are final per utterance — there are no interim is_final partials. See finalization semantics.
  • Streaming API guide →

Batch — POST /speech-to-text/job/v1

  • Recordings are long (up to 2 hours) or you have many files (up to 20 per job).
  • You need speaker diarization and chunk-level timestamps (e.g. subtitles, meeting minutes).
  • Latency isn’t critical — you submit a job and download results when it finishes.
  • Batch API guide →

Need English output regardless of the spoken language? Use the same transport on Speech-to-Text-Translate (/speech-to-text-translate, /speech-to-text-translate/ws, /speech-to-text-translate/job/v1) with mode="translate" on Saaras v3.