WebSocket | Sarvam API Docs

WebSocket channel for real-time speech to text streaming.

Note: This API Reference page is provided for informational purposes only. The Try It playground may not provide the best experience for streaming audio. For optimal streaming performance, please use the SDK or implement your own WebSocket client.

Handshake

WSS

wss://api.sarvam.ai/speech-to-text/ws

Headers

Api-Subscription-KeystringRequired

API subscription key for authentication

Query parameters

language-codeenumRequired

Specifies the language of the input audio in BCP-47 format.

Available Options (saarika:v2.5, legacy):

unknown (default): Use when the language is not known; the API will auto-detect.
hi-IN: Hindi
bn-IN: Bengali
gu-IN: Gujarati
kn-IN: Kannada
ml-IN: Malayalam
mr-IN: Marathi
od-IN: Odia
pa-IN: Punjabi
ta-IN: Tamil
te-IN: Telugu
en-IN: English

Additional Options (saaras:v3, recommended):

as-IN: Assamese
ur-IN: Urdu
ne-IN: Nepali
kok-IN: Konkani
ks-IN: Kashmiri
sd-IN: Sindhi
sa-IN: Sanskrit
sat-IN: Santali
mni-IN: Manipuri
brx-IN: Bodo
mai-IN: Maithili
doi-IN: Dogri

Specifies the language of the input audio in BCP-47 format. **Available Options (saarika:v2.5, legacy):** - `unknown` (default): Use when the language is not known; the API will auto-detect. - `hi-IN`: Hindi - `bn-IN`: Bengali - `gu-IN`: Gujarati - `kn-IN`: Kannada - `ml-IN`: Malayalam - `mr-IN`: Marathi - `od-IN`: Odia - `pa-IN`: Punjabi - `ta-IN`: Tamil - `te-IN`: Telugu - `en-IN`: English **Additional Options (saaras:v3, recommended):** - `as-IN`: Assamese - `ur-IN`: Urdu - `ne-IN`: Nepali - `kok-IN`: Konkani - `ks-IN`: Kashmiri - `sd-IN`: Sindhi - `sa-IN`: Sanskrit - `sat-IN`: Santali - `mni-IN`: Manipuri - `brx-IN`: Bodo - `mai-IN`: Maithili - `doi-IN`: Dogri

modelenumOptionalDefaults to saaras:v3

Specifies the model to use for speech-to-text conversion.

saaras:v3 (default, recommended): State-of-the-art model with flexible output formats. Supports multiple modes via the mode parameter: transcribe, translate, verbatim, translit, codemix.
saarika:v2.5 (legacy): Transcribes audio in the spoken language. Kept for backward compatibility.

Specifies the model to use for speech-to-text conversion. - **saaras:v3** (default, recommended): State-of-the-art model with flexible output formats. Supports multiple modes via the `mode` parameter: transcribe, translate, verbatim, translit, codemix. - **saarika:v2.5** (legacy): Transcribes audio in the spoken language. Kept for backward compatibility.

Allowed values:

modeenumOptionalDefaults to transcribe

Mode of operation. Only applicable when using saaras:v3 model.

Example audio: ‘मेरा फोन नंबर है 9840950950’

transcribe (default): Standard transcription in the original language with proper formatting and number normalization.
- Output: मेरा फोन नंबर है 9840950950
translate: Translates speech from any supported Indic language to English.
- Output: My phone number is 9840950950
verbatim: Exact word-for-word transcription without normalization, preserving filler words and spoken numbers as-is.
- Output: मेरा फोन नंबर है नौ आठ चार zero नौ पांच zero नौ पांच zero
translit: Romanization - Transliterates speech to Latin/Roman script only.
- Output: mera phone number hai 9840950950
codemix: Code-mixed text with English words in English and Indic words in native script.
- Output: मेरा phone number है 9840950950

Mode of operation. **Only applicable when using saaras:v3 model.** Example audio: 'मेरा फोन नंबर है 9840950950' - **transcribe** (default): Standard transcription in the original language with proper formatting and number normalization. - Output: `मेरा फोन नंबर है 9840950950` - **translate**: Translates speech from any supported Indic language to English. - Output: `My phone number is 9840950950` - **verbatim**: Exact word-for-word transcription without normalization, preserving filler words and spoken numbers as-is. - Output: `मेरा फोन नंबर है नौ आठ चार zero नौ पांच zero नौ पांच zero` - **translit**: Romanization - Transliterates speech to Latin/Roman script only. - Output: `mera phone number hai 9840950950` - **codemix**: Code-mixed text with English words in English and Indic words in native script. - Output: `मेरा phone number है 9840950950`

Allowed values:

sample_rateenumOptional

Audio sample rate for the WebSocket connection. When specified as a connection parameter, only 16kHz and 8kHz are supported. 8kHz is only available via this connection parameter. If not specified, defaults to 16kHz.

Allowed values:

high_vad_sensitivityenumOptional

Enable high VAD (Voice Activity Detection) sensitivity

Allowed values:

vad_signalsenumOptional

Enable VAD signals in response

Allowed values:

flush_signalenumOptional

Signal to flush the audio buffer and finalize transcription

Allowed values:

input_audio_codecenumOptional

Audio codec/format of the input stream. Use this when sending raw PCM audio. Supported values: wav, pcm_s16le, pcm_l16, pcm_raw.

Allowed values:

Send

Audio Transcription MessageobjectRequired

Send audio data for real-time speech to text streaming

Speech Flush SignalobjectRequired

Send signal to flush audio buffer and finalize transcription

Receive

TranscriptionobjectRequired

Receive real-time transcription results from the WebSocket

URL	wss://api.sarvam.ai/speech-to-text/ws
Method	GET
Status	101 Switching Protocols

HandshakeTry it

Headers

Query parameters

Send

Receive

Handshake