For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
WebSocket channel for real-time speech to text streaming with English translation.
Note: This API Reference page is provided for informational purposes only.
The Try It playground may not provide the best experience for streaming audio.
For optimal streaming performance, please use the SDK or implement your own WebSocket client.
Handshake
WSS
wss://api.sarvam.ai/speech-to-text-translate/ws
Headers
Api-Subscription-KeystringRequired
API subscription key for authentication
Query parameters
modelenumOptionalDefaults to saaras:v3
Model to be used for speech to text translation.
- **saaras:v3** (default, recommended): State-of-the-art translation model that translates audio from any spoken Indic language to English with flexible output formats via the `mode` parameter.
- **saaras:v2.5** (legacy): Translation model that translates audio from any spoken Indic language to English. Kept for backward compatibility.
- Example: Hindi audio → English text output
Allowed values:
modeenumOptionalDefaults to translate
Mode of operation. **Only applicable when using saaras:v3 model.**
- **translate** (default): Translates speech from any supported Indic language to English.
- Example: Hindi audio → English text output
- **transcribe**: Standard transcription in the original language.
- **verbatim**: Exact word-for-word transcription without normalization.
- **translit**: Romanization - Transliterates speech to Latin/Roman script only.
- **codemix**: Code-mixed text with English words in English and Indic words in native script.
Allowed values:
sample_rateenumOptional
Audio sample rate for the WebSocket connection. When specified as a connection parameter, only 16kHz and 8kHz are supported. 8kHz is only available via this connection parameter. If not specified, defaults to 16kHz.
Allowed values:
high_vad_sensitivityenumOptional
Enable high VAD (Voice Activity Detection) sensitivity
Allowed values:
positive_speech_thresholdstringOptional
VAD probability threshold (0.0–1.0) above which a frame is considered speech.
Overrides the server default when provided.
negative_speech_thresholdstringOptional
VAD probability threshold (0.0–1.0) below which a frame is considered silence.
Overrides the server default (or the high_vad_sensitivity preset) when provided.
min_speech_framesstringOptional
Minimum number of consecutive speech frames required to start a speech segment.
Overrides the server default when provided.
first_turn_min_speech_framesstringOptional
Minimum speech frames required specifically for the first user turn.
Overrides the server default when provided.
negative_frames_countstringOptional
Number of negative (silence) frames needed within the window to end a speech segment.
Overrides the server default (or the high_vad_sensitivity preset) when provided.
negative_frames_windowstringOptional
Sliding window size (in frames) over which negative frames are counted.
Overrides the server default (or the high_vad_sensitivity preset) when provided.
start_speech_volume_thresholdstringOptional
Volume level (dB) below which audio is considered too quiet to be speech.
When not provided, no volume-based filtering is applied.
interrupt_min_speech_framesstringOptional
Minimum speech frames required to register a barge-in / interruption.
Overrides the server default when provided.
pre_speech_pad_framesstringOptional
Number of audio frames to prepend before the detected speech onset,
ensuring the beginning of speech is not clipped.
Overrides the server default when provided.
num_initial_ignored_framesstringOptional
Number of leading audio frames to skip entirely at connection start.
Useful for discarding connection setup noise.
Overrides the server default when provided.
vad_signalsenumOptional
Enable VAD signals in response
Allowed values:
flush_signalenumOptional
Signal to flush the audio buffer and finalize transcription and translation
Allowed values:
input_audio_codecenumOptional
Audio codec/format of the input stream. Use this when sending raw PCM audio.
Supported values: wav, pcm_s16le, pcm_l16, pcm_raw.
Allowed values:
Send
Audio Translation MessageobjectRequired
Send audio data for real-time speech to text streaming with translation
OR
Translation Config MessageobjectRequired
Send configuration for speech to text streaming with translation
OR
Speech Translate Flush SignalobjectRequired
Send signal to flush audio buffer and finalize transcription and translation
Receive
TranslationobjectRequired
Receive real-time transcription and translation results from the WebSocket
saaras:v3 (default, recommended): State-of-the-art translation model that translates audio from any spoken Indic language to English with flexible output formats via the mode parameter.
saaras:v2.5 (legacy): Translation model that translates audio from any spoken Indic language to English. Kept for backward compatibility.
Example: Hindi audio → English text output
Mode of operation. Only applicable when using saaras:v3 model.
translate (default): Translates speech from any supported Indic language to English.
Example: Hindi audio → English text output
transcribe: Standard transcription in the original language.
verbatim: Exact word-for-word transcription without normalization.
translit: Romanization - Transliterates speech to Latin/Roman script only.
codemix: Code-mixed text with English words in English and Indic words in native script.