Streaming Speech-to-Text API
Streaming Speech-to-Text API
Streaming Speech-to-Text API
Transform audio into text in real-time with our WebSocket-based streaming API. Built for applications requiring immediate speech processing with minimal delay.
For complete API reference documentation, see the Speech-to-Text API Reference section.
Model Availability: The Streaming API supports Saaras v3 (recommended) with multiple output modes via the mode parameter. Legacy models Saarika v2.5 and Saaras v2.5 are also available but we recommend switching to Saaras v3 for the best accuracy and features.
Get transcription results in milliseconds, not seconds. Process speech as it happens with near-instantaneous responses.
Support for 10+ Indian languages plus English with high accuracy transcription and translation capabilities.
Smart Voice Activity Detection (VAD) with customizable sensitivity for optimal speech boundary detection.
Audio Format Support: Streaming APIs only support two audio formats:
wav)pcm_s16le, pcm_l16, pcm_raw)Other formats like MP3, AAC, OGG, etc. are not supported for WebSocket streaming. Find sample audio files in our GitHub cookbook.
Get up and running with streaming in minutes. Simply change the mode parameter to switch between transcription, translation, and other output formats.
Transcribe audio in the original language.
Here’s a complete working example. Change the mode parameter to switch between any of the supported modes:
Add smart voice activity detection for better accuracy and control:
Force immediate processing without waiting for silence detection:
Optimize for your specific audio setup (e.g., 8kHz telephony audio):
Important: Sample Rate Configuration for 8kHz Audio
When working with 8kHz audio, you must set the sample_rate parameter in both places:
Both values must match your audio’s actual sample rate. Mismatched sample rates will result in poor transcription quality or errors.
For detailed endpoint documentation, see: Speech-to-Text WebSocket | Speech-to-Text Translate WebSocket
Long-lived sockets will occasionally drop (network blips, idle timeouts, server restarts). Inspect the WebSocket close code and reconnect with backoff.
Codes 1000–1015 are standard WebSocket codes. Any 4000–4999 code is application-specific — always read the accompanying close reason string rather than assuming a fixed meaning.
Reconnect with exponential backoff (pseudocode — applies to both SDKs):
Do not auto-retry on 4xxx auth/quota closes — fix the underlying issue first (see Errors & Troubleshooting).
In a voice agent, the user may start speaking while your TTS reply is still playing (“barge-in”). Use vad_signals=true and treat the START_SPEECH event as the cue to stop playback immediately and let the user take the turn.
For conversational use, prefer high_vad_sensitivity=True (0.5s silence boundary) so the agent reacts quickly. See the LiveKit and Pipecat voice-agent integration guides for full agent setups, and Credits & Rate Limits for concurrency limits on streaming connections.
Configure your WebSocket connection with these parameters:
When sending audio data to the streaming endpoint:
When vad_signals=true, you’ll receive different message types:
For STT:
speech_start: Voice activity detectedspeech_end: Voice activity stoppedtranscript: Final transcription resultFor STTT:
speech_start: Voice activity detectedspeech_end: Voice activity stoppedtranslation: Final translation resultsample_rate=8000 in both connection and transcribe/translate callshigh_vad_sensitivity=falsehigh_vad_sensitivity=truesaaras:v3) with mode parameter for the best transcription quality and flexible output modessaarika:v2.5) for transcription in the original language (legacy)saaras:v2.5) for direct translation to English (legacy)