Streaming Speech-to-Text API
Overview
Transform audio into text in real-time with our WebSocket-based streaming API. Built for applications requiring immediate speech processing with minimal delay.
Key Benefits
Get transcription results in milliseconds, not seconds. Process speech as it happens with near-instantaneous responses.
Support for 10+ Indian languages plus English with high accuracy transcription and translation capabilities.
Smart Voice Activity Detection (VAD) with customizable sensitivity for optimal speech boundary detection.
Common Use Cases
- Live Transcription: Real-time captions for meetings, webinars, and broadcasts
- Voice Assistants: Interactive voice applications with immediate responses
- Call Centers: Live call transcription and analysis
- Accessibility: Real-time captioning for hearing-impaired users
Audio Format Support: Streaming APIs support .wav and raw PCM formats only. Find sample audio files in our GitHub cookbook.
Getting Started
Get up and running with streaming in minutes. Choose between Speech-to-Text (STT) for transcription or Speech-to-Text Translation (STTT) for direct translation.
Basic Usage
The simplest way to get started with real-time processing:
Speech-to-Text (STT)
Speech-to-Text Translation (STTT)
Python
JavaScript
Enhanced Processing with Voice Detection
Add smart voice activity detection for better accuracy and control:
Speech-to-Text (STT)
Speech-to-Text Translation (STTT)
Python
JavaScript
Instant Processing with Flush Signals
Force immediate processing without waiting for silence detection:
Speech-to-Text (STT)
Speech-to-Text Translation (STTT)
Python
JavaScript
Custom Audio Configuration
Optimize for your specific audio setup:
Speech-to-Text (STT)
Speech-to-Text Translation (STTT)
Python
JavaScript
Important: Sample Rate Configuration for 8kHz Audio
When working with 8kHz audio, you must set the sample_rate parameter in both places:
- When connecting to the WebSocket (connection parameter)
- When sending audio data (transcribe/translate parameter)
Both values must match your audio’s actual sample rate. Mismatched sample rates will result in poor transcription quality or errors.
Example for STT:
Example for STTT:
For detailed endpoint documentation, see: Speech-to-Text WebSocket | Speech-to-Text Translate WebSocket
API Reference
Connection Parameters
Configure your WebSocket connection with these parameters:
Audio Data Parameters
When sending audio data to the streaming endpoint:
Response Types
When vad_signals=true, you’ll receive different message types:
For STT:
speech_start: Voice activity detectedspeech_end: Voice activity stoppedtranscript: Final transcription result
For STTT:
speech_start: Voice activity detectedspeech_end: Voice activity stoppedtranslation: Final translation result
Key Differences: STT vs STTT
Best Practices
- Audio Quality & Sample Rate:
- Use 16kHz sample rate for best results
- For 8kHz audio, always set
sample_rate=8000in both connection and transcribe/translate calls - Ensure both sample rate parameters match your actual audio sample rate
- Silence Handling:
- Use 1 second silence when
high_vad_sensitivity=false - Use 0.5 seconds silence when
high_vad_sensitivity=true
- Use 1 second silence when
- Continuous Streaming: Send audio data continuously for real-time results
- Error Handling: Always implement proper WebSocket error handling
- Model Selection:
- Use Saarika (
saarika:v2.5) for transcription in the original language - Use Saaras (
saaras:v2.5) for direct translation to English
- Use Saarika (