> For clean Markdown of any page, append `.md` to the page URL.
> For a complete documentation index, see https://docs.sarvam.ai/llms.txt.
> For full documentation content in one file, see https://docs.sarvam.ai/llms-full.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.sarvam.ai/_mcp/server.

# Which Speech-to-Text API to Use

> Compare Sarvam's Speech-to-Text APIs — REST, WebSocket, and Batch (plus their Speech-to-Text-Translate variants) — and pick the right one for your audio length, latency, and feature needs.

Sarvam gives you three ways to run speech recognition on the same models: the **REST**, **WebSocket**, and **Batch** APIs. They differ in how you send audio, how fast you get results, the maximum audio they accept, and which features (diarization, timestamps) are available. Use this page to pick one before you start integrating.

Every transport has a **Speech-to-Text-Translate** counterpart that returns **English** instead of the source language. The transport trade-offs below are identical — only the output language changes. See [Speech-to-Text-Translate](/api-reference-docs/speech-to-text-translate/translate).

## Quick decision

A short clip (≤30s) and you want the transcript back in one call.

Live microphone or call audio that needs results as the user speaks.

Long recordings (up to 1 hour), with diarization and timestamps.

## Comparison

|                              | REST                                                                                                                     | WebSocket                                                           | Batch                                                                                     |
| ---------------------------- | ------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------- | ----------------------------------------------------------------------------------------- |
| **Endpoint**                 | `POST /speech-to-text`                                                                                                   | `GET /speech-to-text/ws`                                            | `POST /speech-to-text/job/v1` (job flow)                                                  |
| **Processing**               | Synchronous                                                                                                              | Real-time streaming                                                 | Asynchronous (job)                                                                        |
| **Max audio length**         | 30 seconds                                                                                                               | Continuous (chunked)                                                | 1 hour per file                                                                           |
| **Files per request**        | 1                                                                                                                        | 1 stream                                                            | Up to 20 per job                                                                          |
| **Results**                  | Final transcript in the response                                                                                         | Final transcript per utterance (on VAD end-of-speech or `flush()`)  | Final transcript, downloaded when the job completes                                       |
| **Latency**                  | One round-trip after upload                                                                                              | Lowest — results arrive while audio is still streaming              | Highest — minutes, depending on queue and duration                                        |
| **Speaker diarization**      | No                                                                                                                       | No                                                                  | Yes                                                                                       |
| **Timestamps**               | No                                                                                                                       | No                                                                  | Yes (chunk-level)                                                                         |
| **Audio formats**            | All [supported formats](/api-reference-docs/api-guides-tutorials/speech-to-text/overview) (auto-detected; PCM at 16 kHz) | **WAV and raw PCM only** (`wav`, `pcm_s16le`, `pcm_l16`, `pcm_raw`) | All [supported formats](/api-reference-docs/api-guides-tutorials/speech-to-text/overview) |
| **Output modes** (Saaras v3) | `transcribe`, `translate`, `verbatim`, `translit`, `codemix`                                                             | `transcribe`, `translate`, `verbatim`, `translit`, `codemix`        | `transcribe`, `translate`, `verbatim`, `translit`, `codemix`                              |
| **Best for**                 | Short clips, voice commands, quick tests                                                                                 | Voice agents, live captions, call streaming                         | Meetings, interviews, call-center recordings, bulk pipelines                              |

## When to use each

**REST — `POST /speech-to-text`**

* The audio is already captured and short (≤30 seconds).
* You want one request, one response — no connection to manage.
* Examples: voice search, push-to-talk commands, transcribing a short voice note.
* [REST API guide →](/api-reference-docs/api-guides-tutorials/speech-to-text/rest-api)

**WebSocket — `GET /speech-to-text/ws`**

* Audio arrives continuously from a mic, browser, or telephony stream.
* You need transcripts as the user speaks (live captions, barge-in for voice agents).
* Note: only **WAV / raw PCM** is accepted, and results are **final per utterance** — there are no interim `is_final` partials. See [finalization semantics](/api-reference-docs/api-guides-tutorials/speech-to-text/streaming-api).
* [Streaming API guide →](/api-reference-docs/api-guides-tutorials/speech-to-text/streaming-api)

**Batch — `POST /speech-to-text/job/v1`**

* Recordings are long (up to 1 hour) or you have many files (up to 20 per job).
* You need **speaker diarization** and **chunk-level timestamps** (e.g. subtitles, meeting minutes).
* Latency isn't critical — you submit a job and download results when it finishes.
* [Batch API guide →](/api-reference-docs/api-guides-tutorials/speech-to-text/batch-api)

Need English output regardless of the spoken language? Use the same transport on **Speech-to-Text-Translate** (`/speech-to-text-translate`, `/speech-to-text-translate/ws`, `/speech-to-text-translate/job/v1`) with `mode="translate"` on Saaras v3.

## Related

* [Speech-to-Text overview](/api-reference-docs/api-guides-tutorials/speech-to-text/overview)
* [Supported audio formats & MIME types](/api-reference-docs/api-guides-tutorials/speech-to-text/overview)
* [Credits & Rate Limits](/api-reference-docs/ratelimits)