Streaming Speech-to-Text API
Overview
Transform audio into text in real-time with our WebSocket-based streaming API. Built for applications requiring immediate speech processing with minimal delay.
Key Benefits
Get transcription results in milliseconds, not seconds. Process speech as it happens with near-instantaneous responses.
Support for 10+ Indian languages plus English with high accuracy transcription and translation capabilities.
Smart Voice Activity Detection (VAD) with customizable sensitivity for optimal speech boundary detection.
Common Use Cases
- Live Transcription: Real-time captions for meetings, webinars, and broadcasts
- Voice Assistants: Interactive voice applications with immediate responses
- Call Centers: Live call transcription and analysis
- Accessibility: Real-time captioning for hearing-impaired users
Audio Format Support: Streaming APIs only support two audio formats:
- WAV (
wav) - Raw PCM (
pcm_s16le,pcm_l16,pcm_raw)
Other formats like MP3, AAC, OGG, etc. are not supported for WebSocket streaming. Find sample audio files in our GitHub cookbook.
Getting Started
Get up and running with streaming in minutes. Choose between Speech-to-Text (STT) for transcription or Speech-to-Text Translation (STTT) for direct translation.
Basic Usage
The simplest way to get started with real-time processing:
Speech-to-Text (STT)
Speech-to-Text Translation (STTT)
Python
JavaScript
Enhanced Processing with Voice Detection
Add smart voice activity detection for better accuracy and control:
Speech-to-Text (STT)
Speech-to-Text Translation (STTT)
Python
JavaScript
Instant Processing with Flush Signals
Force immediate processing without waiting for silence detection:
Speech-to-Text (STT)
Speech-to-Text Translation (STTT)
Python
JavaScript
Custom Audio Configuration
Optimize for your specific audio setup:
Speech-to-Text (STT)
Speech-to-Text Translation (STTT)
Python
JavaScript
Important: Sample Rate Configuration for 8kHz Audio
When working with 8kHz audio, you must set the sample_rate parameter in both places:
- When connecting to the WebSocket (connection parameter)
- When sending audio data (transcribe/translate parameter)
Both values must match your audio’s actual sample rate. Mismatched sample rates will result in poor transcription quality or errors.
Example for STT:
Example for STTT:
For detailed endpoint documentation, see: Speech-to-Text WebSocket | Speech-to-Text Translate WebSocket
API Reference
Connection Parameters
Configure your WebSocket connection with these parameters:
Audio Data Parameters
When sending audio data to the streaming endpoint:
Response Types
When vad_signals=true, you’ll receive different message types:
For STT:
speech_start: Voice activity detectedspeech_end: Voice activity stoppedtranscript: Final transcription result
For STTT:
speech_start: Voice activity detectedspeech_end: Voice activity stoppedtranslation: Final translation result
Key Differences: STT vs STTT
Best Practices
- Audio Quality & Sample Rate:
- Use 16kHz sample rate for best results
- For 8kHz audio, always set
sample_rate=8000in both connection and transcribe/translate calls - Ensure both sample rate parameters match your actual audio sample rate
- Silence Handling:
- Use 1 second silence when
high_vad_sensitivity=false - Use 0.5 seconds silence when
high_vad_sensitivity=true
- Use 1 second silence when
- Continuous Streaming: Send audio data continuously for real-time results
- Error Handling: Always implement proper WebSocket error handling
- Model Selection:
- Use Saarika (
saarika:v2.5) for transcription in the original language - Use Saaras (
saaras:v2.5) for direct translation to English
- Use Saarika (