Streaming Speech-to-Text API
Real-time Processing
Process audio streams in real-time with WebSocket connections. Ideal for:
- Live transcription
- Real-time translation
- Interactive applications
- Low-latency requirements
Features
- Live audio transcription - WebSocket-based streaming - Low latency responses
- Multiple Indian languages and English support
- Language code specification (e.g., “kn-IN” for Kannada)
- High accuracy transcription
- Python and JavaScript SDK with async support
- WebSocket connections
- Easy-to-use API interface
- WebSockets of STT and STTT only support
.wav
and raw PCM
Best Practices
- Send a continuous stream of audio data
- Use 1 second of silence when VAD (Voice Activity Detection) sensitivity is FALSE
- Use 0.5 seconds of silence when VAD (Voice Activity Detection) sensitivity is TRUE
- One can send arbitrary length of audio
You can find sample audio files and their corresponding base64 encoded strings in the
GitHub cookbook
Saarika: Our Speech to Text Transcription Model
Basic Streaming Transcription
Python
JavaScript
Streaming Guide
Query Parameters
Send Parameters
language_code
: Specifies the language for speech recognition (e.g.,en-IN
,hi-IN
,kn-IN
, etc.).model
: Selects the speech-to-text model version (e.g.,saarika:v2.5
).high_vad_sensitivity
: Enables high sensitivity for Voice Activity Detection (VAD), helpful in noisy or soft speech environments.vad_signals
: When enabled, provides VAD event signals in the response stream:- “speech_start”: Indicates the beginning of speech detection
- ”speech_end”: Indicates the end of speech detection
- ”transcript”: Contains the final transcription after speech end
sample_rate
: Specifies the sample rate of the input audio in Hz (e.g.,8000
,16000
,44100
). Allows for optimal processing based on your audio quality.
New Streaming Features
Sample Rate and Input Audio Codec Support
STT streaming now supports specifying both the sample rate and input audio codec for better audio processing optimization.
Python
Example Usage
Basic Streaming
Streaming with VAD Signals
When using VAD signals, the API returns multiple messages in sequence. You’ll need to handle these messages appropriately and wait for the complete sequence (speech_start → speech_end → transcript). Here’s an example:
Note: When using vad_signals=True
, expect a slight delay between
receiving the “speech_end” signal and the final transcript. This delay allows
the model to process the complete audio segment and generate accurate
transcription. The timeout in the example above can be adjusted based on your
audio length and requirements.
Saaras Model: Our Speech to Text Translation Model
Streaming Translation
Python
JavaScript
STT Translation Streaming Guide
Query Parameters
Send Parameters
model
: Selects the speech-to-text model version (e.g.,saaras:v2.5
).high_vad_sensitivity
: Enables high sensitivity for Voice Activity Detection (VAD), helpful in noisy or soft speech environments.vad_signals
: When enabled, provides VAD event signals in the response stream:- “speech_start”: Indicates the beginning of speech detection
- ”speech_end”: Indicates the end of speech detection
- ”transcript”: Contains the transcription and translation after speech end
sample_rate
: Specifies the sample rate of the input audio in Hz (e.g.,8000
,16000
,44100
). Allows for optimal processing based on your audio quality.
Example Usage
Basic Streaming
Streaming with VAD Signals
When using VAD signals, the API returns multiple messages in sequence. You’ll need to handle these messages appropriately and wait for the complete sequence (speech_start → speech_end → transcript). Here’s an example:
When using vad_signals=True
, expect a slight delay between receiving the
“speech_end” signal and the final transcript with translation. This delay
allows the model to process the complete audio segment, generate
transcription, and perform translation. The timeout in the example above can
be adjusted based on your audio length and requirements.
Flush Signal
When streaming audio, you may want to flush the buffer before sending a new chunk.
The flush_signal
lets you:
- Immediately process everything in the current buffer and get a transcript without waiting.
- Reduce latency so that interactive experiences (like live captions or assistants) feel more natural.
- Take control of when audio is finalized, instead of relying only on silence detection or timeouts.