Which Speech-to-Text API to Use
Which Speech-to-Text API to Use
Which Speech-to-Text API to Use
Sarvam gives you three ways to run speech recognition on the same models: the REST, WebSocket, and Batch APIs. They differ in how you send audio, how fast you get results, the maximum audio they accept, and which features (diarization, timestamps) are available. Use this page to pick one before you start integrating.
Every transport has a Speech-to-Text-Translate counterpart that returns English instead of the source language. The transport trade-offs below are identical — only the output language changes. See Speech-to-Text-Translate.
A short clip (≤30s) and you want the transcript back in one call.
Live microphone or call audio that needs results as the user speaks.
Long recordings (up to 1 hour), with diarization and timestamps.
REST — POST /speech-to-text
WebSocket — GET /speech-to-text/ws
is_final partials. See finalization semantics.Batch — POST /speech-to-text/job/v1
Need English output regardless of the spoken language? Use the same transport on Speech-to-Text-Translate (/speech-to-text-translate, /speech-to-text-translate/ws, /speech-to-text-translate/job/v1) with mode="translate" on Saaras v3.