WebSocket | Sarvam API Docs

WebSocket channel for real-time speech to text streaming with English translation.

Note: This API Reference page is provided for informational purposes only. The Try It playground may not provide the best experience for streaming audio. For optimal streaming performance, please use the SDK or implement your own WebSocket client.

Handshake

WSS

wss://api.sarvam.ai/speech-to-text-translate/ws

Headers

Api-Subscription-KeystringRequired

API subscription key for authentication

Query parameters

modelenumOptionalDefaults to saaras:v3

Model to be used for speech to text translation.

saaras:v3 (default, recommended): State-of-the-art translation model that translates audio from any spoken Indic language to English with flexible output formats via the mode parameter.
saaras:v2.5 (legacy): Translation model that translates audio from any spoken Indic language to English. Kept for backward compatibility.
- Example: Hindi audio → English text output

Model to be used for speech to text translation. - **saaras:v3** (default, recommended): State-of-the-art translation model that translates audio from any spoken Indic language to English with flexible output formats via the `mode` parameter. - **saaras:v2.5** (legacy): Translation model that translates audio from any spoken Indic language to English. Kept for backward compatibility. - Example: Hindi audio → English text output

Allowed values:

modeenumOptionalDefaults to translate

Mode of operation. Only applicable when using saaras:v3 model.

translate (default): Translates speech from any supported Indic language to English.
- Example: Hindi audio → English text output
transcribe: Standard transcription in the original language.
verbatim: Exact word-for-word transcription without normalization.
translit: Romanization - Transliterates speech to Latin/Roman script only.
codemix: Code-mixed text with English words in English and Indic words in native script.

Mode of operation. **Only applicable when using saaras:v3 model.** - **translate** (default): Translates speech from any supported Indic language to English. - Example: Hindi audio → English text output - **transcribe**: Standard transcription in the original language. - **verbatim**: Exact word-for-word transcription without normalization. - **translit**: Romanization - Transliterates speech to Latin/Roman script only. - **codemix**: Code-mixed text with English words in English and Indic words in native script.

Allowed values:

sample_rateenumOptional

Audio sample rate for the WebSocket connection. When specified as a connection parameter, only 16kHz and 8kHz are supported. 8kHz is only available via this connection parameter. If not specified, defaults to 16kHz.

Allowed values:

high_vad_sensitivityenumOptional

Enable high VAD (Voice Activity Detection) sensitivity

Allowed values:

vad_signalsenumOptional

Enable VAD signals in response

Allowed values:

flush_signalenumOptional

Signal to flush the audio buffer and finalize transcription and translation

Allowed values:

input_audio_codecenumOptional

Audio codec/format of the input stream. Use this when sending raw PCM audio. Supported values: wav, pcm_s16le, pcm_l16, pcm_raw.

Allowed values:

Send

Audio Translation MessageobjectRequired

Send audio data for real-time speech to text streaming with translation

Translation Config MessageobjectRequired

Send configuration for speech to text streaming with translation

Speech Translate Flush SignalobjectRequired

Send signal to flush audio buffer and finalize transcription and translation

Receive

TranslationobjectRequired

Receive real-time transcription and translation results from the WebSocket

URL	wss://api.sarvam.ai/speech-to-text-translate/ws
Method	GET
Status	101 Switching Protocols

HandshakeTry it

Headers

Query parameters

Send

Receive

Handshake