For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
CommunityAPI StatusAPI PricingSign Up
DocumentationAPI ReferencesCookbookIntegrationDeveloper Tools
DocumentationAPI ReferencesCookbookIntegrationDeveloper Tools
  • Getting Started
    • Welcome
    • Quickstart
    • SDKs & Libraries
    • Building for Indian Languages
    • Models
    • Credits & Rate Limits
    • Errors & Troubleshooting
    • Talk to us
    • Pricing
    • Changelog
  • API Guides & Tutorials
      • Overview
      • Which API to Use
      • Rest API
      • Batch API
      • Streaming API
      • FAQs
LogoLogo
CommunityAPI StatusAPI PricingSign Up
On this page
  • Quick decision
  • Comparison
  • When to use each
  • Related
API Guides & TutorialsSpeech to Text

Which Speech-to-Text API to Use

||View as Markdown|
Was this page helpful?
Previous

Speech-to-Text Rest API

Next
Built with

Sarvam gives you three ways to run speech recognition on the same models: the REST, WebSocket, and Batch APIs. They differ in how you send audio, how fast you get results, the maximum audio they accept, and which features (diarization, timestamps) are available. Use this page to pick one before you start integrating.

Every transport has a Speech-to-Text-Translate counterpart that returns English instead of the source language. The transport trade-offs below are identical — only the output language changes. See Speech-to-Text-Translate.

Quick decision

REST

A short clip (≤30s) and you want the transcript back in one call.

WebSocket

Live microphone or call audio that needs results as the user speaks.

Batch

Long recordings (up to 1 hour), with diarization and timestamps.

Comparison

RESTWebSocketBatch
EndpointPOST /speech-to-textGET /speech-to-text/wsPOST /speech-to-text/job/v1 (job flow)
ProcessingSynchronousReal-time streamingAsynchronous (job)
Max audio length30 secondsContinuous (chunked)1 hour per file
Files per request11 streamUp to 20 per job
ResultsFinal transcript in the responseFinal transcript per utterance (on VAD end-of-speech or flush())Final transcript, downloaded when the job completes
LatencyOne round-trip after uploadLowest — results arrive while audio is still streamingHighest — minutes, depending on queue and duration
Speaker diarizationNoNoYes
TimestampsNoNoYes (chunk-level)
Audio formatsAll supported formats (auto-detected; PCM at 16 kHz)WAV and raw PCM only (wav, pcm_s16le, pcm_l16, pcm_raw)All supported formats
Output modes (Saaras v3)transcribe, translate, verbatim, translit, codemixtranscribe, translate, verbatim, translit, codemixtranscribe, translate, verbatim, translit, codemix
Best forShort clips, voice commands, quick testsVoice agents, live captions, call streamingMeetings, interviews, call-center recordings, bulk pipelines

When to use each

REST — POST /speech-to-text

  • The audio is already captured and short (≤30 seconds).
  • You want one request, one response — no connection to manage.
  • Examples: voice search, push-to-talk commands, transcribing a short voice note.
  • REST API guide →

WebSocket — GET /speech-to-text/ws

  • Audio arrives continuously from a mic, browser, or telephony stream.
  • You need transcripts as the user speaks (live captions, barge-in for voice agents).
  • Note: only WAV / raw PCM is accepted, and results are final per utterance — there are no interim is_final partials. See finalization semantics.
  • Streaming API guide →

Batch — POST /speech-to-text/job/v1

  • Recordings are long (up to 1 hour) or you have many files (up to 20 per job).
  • You need speaker diarization and chunk-level timestamps (e.g. subtitles, meeting minutes).
  • Latency isn’t critical — you submit a job and download results when it finishes.
  • Batch API guide →

Need English output regardless of the spoken language? Use the same transport on Speech-to-Text-Translate (/speech-to-text-translate, /speech-to-text-translate/ws, /speech-to-text-translate/job/v1) with mode="translate" on Saaras v3.

Related

  • Speech-to-Text overview
  • Supported audio formats & MIME types
  • Credits & Rate Limits