Streaming Speech-to-Text API

Overview

Transform audio into text in real-time with our WebSocket-based streaming API. Built for applications requiring immediate speech processing with minimal delay.

For complete API reference documentation, see the Speech-to-Text API Reference section.

Model Availability: The Streaming API supports Saaras v3 (recommended) with multiple output modes via the mode parameter. Legacy models Saarika v2.5 and Saaras v2.5 are also available but we recommend switching to Saaras v3 for the best accuracy and features.

Supported Modes (Saaras v3)

Mode	Description	Output
`transcribe`	Standard transcription in the original language	Text in source language
`translate`	Transcribe and translate to English	English text
`verbatim`	Word-for-word transcription including filler words and repetitions	Verbatim text in source language
`translit`	Transcribe and transliterate to Roman script	Romanized text
`codemix`	Transcribe code-mixed speech (e.g., Hindi-English) naturally	Code-mixed text

Key Benefits

Ultra-Low Latency

Get transcription results in milliseconds, not seconds. Process speech as it happens with near-instantaneous responses.

Multi-Language Support

Support for 10+ Indian languages plus English with high accuracy transcription and translation capabilities.

Advanced Voice Detection

Smart Voice Activity Detection (VAD) with customizable sensitivity for optimal speech boundary detection.

Common Use Cases

Live Transcription: Real-time captions for meetings, webinars, and broadcasts
Voice Assistants: Interactive voice applications with immediate responses
Call Centers: Live call transcription and analysis
Accessibility: Real-time captioning for hearing-impaired users

Audio Format Support: Streaming APIs only support two audio formats:

WAV (wav)
Raw PCM (pcm_s16le, pcm_l16, pcm_raw)

Other formats like MP3, AAC, OGG, etc. are not supported for WebSocket streaming. Find sample audio files in our GitHub cookbook.

Getting Started

Get up and running with streaming in minutes. Simply change the mode parameter to switch between transcription, translation, and other output formats.

Choosing a Mode

Transcribe

Translate

Verbatim

Translit

Codemix

Transcribe audio in the original language.

1 async with client.speech_to_text_streaming.connect(
2     model="saaras:v3",
3     mode="transcribe",              # Standard transcription
4     language_code="en-IN",
5     high_vad_sensitivity=True
6 ) as ws:
7     await ws.transcribe(audio=audio_data)
8     response = await ws.recv()
9     print(f"Transcription: {response}")

Full Example

Here’s a complete working example. Change the mode parameter to switch between any of the supported modes:

1 import asyncio
2 import base64
3 from sarvamai import AsyncSarvamAI
4 
5 # Load your audio file
6 with open("path/to/your/audio.wav", "rb") as f:
7     audio_data = base64.b64encode(f.read()).decode("utf-8")
8 
9 async def basic_transcription():
10     # Initialize client with your API key
11     client = AsyncSarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")
12 
13     # Connect and transcribe — change mode as needed
14     async with client.speech_to_text_streaming.connect(
15         model="saaras:v3",
16         mode="transcribe",
17         language_code="en-IN",
18         high_vad_sensitivity=True
19     ) as ws:
20         await ws.transcribe(audio=audio_data)
21         response = await ws.recv()
22         print(f"Result: {response}")
23 
24 asyncio.run(basic_transcription())

Enhanced Processing with Voice Detection

Add smart voice activity detection for better accuracy and control:

1 import asyncio
2 import base64
3 from sarvamai import AsyncSarvamAI
4 
5 with open("path/to/your/audio.wav", "rb") as f:
6     audio_data = base64.b64encode(f.read()).decode("utf-8")
7 
8 async def enhanced_transcription():
9     client = AsyncSarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")
10 
11     async with client.speech_to_text_streaming.connect(
12         model="saaras:v3",
13         mode="transcribe",              # Change mode as needed
14         language_code="hi-IN",
15         high_vad_sensitivity=True,       # Better voice detection
16         vad_signals=True                # Get speech start/end signals
17     ) as ws:
18         await ws.transcribe(
19             audio=audio_data,
20             encoding="audio/wav",
21             sample_rate=16000
22         )
23         
24         async for message in ws:
25             if message.type == "events":
26                 # VAD signals arrive as events (signal_type is START_SPEECH / END_SPEECH)
27                 print(f"Voice activity: {message.data.signal_type}")
28             elif message.type == "data":
29                 print(f"Result: {message.data.transcript}")
30                 break
31 
32 asyncio.run(enhanced_transcription())

Instant Processing with Flush Signals

Force immediate processing without waiting for silence detection:

1 import asyncio
2 import base64
3 from sarvamai import AsyncSarvamAI
4 
5 with open("path/to/your/audio.wav", "rb") as f:
6     audio_data = base64.b64encode(f.read()).decode("utf-8")
7 
8 async def instant_processing():
9     client = AsyncSarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")
10 
11     async with client.speech_to_text_streaming.connect(
12         model="saaras:v3",
13         mode="transcribe",              # Change mode as needed
14         language_code="en-IN",
15         flush_signal=True               # Enable manual control
16     ) as ws:
17         await ws.transcribe(
18             audio=audio_data,
19             encoding="audio/wav",
20             sample_rate=16000
21         )
22         
23         # Force immediate processing
24         await ws.flush()
25 
26         async for message in ws:
27             print(f"Result: {message}")
28             break
29 
30 asyncio.run(instant_processing())

Custom Audio Configuration

Optimize for your specific audio setup (e.g., 8kHz telephony audio):

1 import asyncio
2 import base64
3 from sarvamai import AsyncSarvamAI
4 
5 with open("path/to/your/audio.wav", "rb") as f:
6     audio_data = base64.b64encode(f.read()).decode("utf-8")
7 
8 async def custom_audio_config():
9     client = AsyncSarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")
10 
11     async with client.speech_to_text_streaming.connect(
12         model="saaras:v3",
13         mode="transcribe",              # Change mode as needed
14         language_code="kn-IN",
15         sample_rate=8000,               # Match your audio
16         input_audio_codec="pcm_s16le",  # Specify codec
17         high_vad_sensitivity=True
18     ) as ws:
19         await ws.transcribe(
20             audio=audio_data,
21             encoding="audio/wav",
22             sample_rate=8000             # Must match connection setting
23         )
24         
25         response = await ws.recv()
26         print(f"Result: {response}")
27 
28 asyncio.run(custom_audio_config())

Important: Sample Rate Configuration for 8kHz Audio

When working with 8kHz audio, you must set the sample_rate parameter in both places:

When connecting to the WebSocket (connection parameter)
When sending audio data (transcribe parameter)

Both values must match your audio’s actual sample rate. Mismatched sample rates will result in poor transcription quality or errors.

1 async with client.speech_to_text_streaming.connect(
2     model="saaras:v3",
3     mode="transcribe",
4     language_code="en-IN",
5     sample_rate=8000        # Must match your audio
6 ) as ws:
7     await ws.transcribe(
8         audio=audio_data,
9         sample_rate=8000    # Must match connection setting
10     )

For detailed endpoint documentation, see: Speech-to-Text WebSocket | Speech-to-Text Translate WebSocket

Handling Disconnects

Long-lived sockets will occasionally drop (network blips, idle timeouts, server restarts). Inspect the WebSocket close code and reconnect with backoff.

Close code	Meaning	What to do
`1000`	Normal closure	You called `close()` — nothing to do
`1001`	Going away	Server or client shutting down — reconnect
`1006`	Abnormal closure (no close frame)	Network drop — reconnect with backoff
`1011`	Server error	Retry with backoff; if persistent, check status
`4xxx`	Application-specific	Read the close reason for details (e.g. auth or quota); fix before reconnecting

Codes 1000–1015 are standard WebSocket codes. Any 4000–4999 code is application-specific — always read the accompanying close reason string rather than assuming a fixed meaning.

Reconnect with exponential backoff (pseudocode — applies to both SDKs):

attempt = 0
while not connected and attempt < MAX_ATTEMPTS:
    try:
        open WebSocket and resume streaming
        attempt = 0                      # reset on success
    except (close 1006 / 1011 / network error):
        delay = min(BASE * 2 ** attempt, MAX_DELAY)   # e.g. 0.5s, 1s, 2s, 4s ... capped
        sleep(delay + small random jitter)
        attempt += 1
    on close 4xxx (auth/quota): stop and surface the error  # do not blind-retry

Do not auto-retry on 4xxx auth/quota closes — fix the underlying issue first (see Errors & Troubleshooting).

Voice-Agent Barge-In

In a voice agent, the user may start speaking while your TTS reply is still playing (“barge-in”). Use vad_signals=true and treat the START_SPEECH event as the cue to stop playback immediately and let the user take the turn.

1 async with client.speech_to_text_streaming.connect(
2     model="saaras:v3",
3     mode="transcribe",
4     language_code="hi-IN",
5     high_vad_sensitivity=True,   # 0.5s silence boundary — snappier for conversation
6     vad_signals=True,            # emit START_SPEECH / END_SPEECH events
7 ) as ws:
8     await ws.transcribe(audio=mic_chunk, encoding="audio/wav", sample_rate=16000)
9 
10     async for message in ws:
11         if message.type == "events" and message.data.signal_type == "START_SPEECH":
12             tts_player.stop()          # barge-in: cut off the agent's current reply
13         elif message.type == "data":
14             handle_user_turn(message.data.transcript)

For conversational use, prefer high_vad_sensitivity=True (0.5s silence boundary) so the agent reacts quickly. See the LiveKit and Pipecat voice-agent integration guides for full agent setups, and Credits & Rate Limits for concurrency limits on streaming connections.

API Reference

Connection Parameters

Configure your WebSocket connection with these parameters:

Parameter	Type	Description	Example
`language_code`	string	Language for speech recognition (STT only)	`"en-IN"`, `"hi-IN"`, `"kn-IN"`
`model`	string	Model version to use	`"saaras:v3"` (recommended), `"saarika:v2.5"` (legacy), `"saaras:v2.5"` (legacy)
`mode`	string	Output mode (saaras:v3 only): transcribe, translate, verbatim, translit, codemix	`"transcribe"`
`sample_rate`	integer	Audio sample rate in Hz	`8000`, `16000`
`input_audio_codec`	string	Audio codec format. Only `wav` and raw PCM formats (`pcm_s16le`, `pcm_l16`, `pcm_raw`) are supported	`"wav"`, `"pcm_s16le"`
`high_vad_sensitivity`	boolean	Enhanced voice activity detection	`true`, `false`
`vad_signals`	boolean	Receive speech start/end events	`true`, `false`
`flush_signal`	boolean	Enable manual buffer flushing	`true`, `false`

Fine VAD Tuning Parameters

For finer control than the high_vad_sensitivity preset, both the STT and STT-Translate WebSockets accept these optional parameters. Any value you pass overrides the server default (and the high_vad_sensitivity preset, where they overlap):

Parameter	Type	Description	Default
`positive_speech_threshold`	float	VAD probability (0.0–1.0) above which a frame counts as speech	`0.7`
`negative_speech_threshold`	float	VAD probability (0.0–1.0) below which a frame counts as silence	`0.45`
`min_speech_frames`	integer	Consecutive speech frames required to start a speech segment	`2`
`first_turn_min_speech_frames`	integer	Speech frames required specifically for the first user turn	`8`
`negative_frames_count`	integer	Silence frames within the window needed to end a speech segment	`18`
`negative_frames_window`	integer	Sliding window size (in frames) over which silence frames are counted	`24`
`start_speech_volume_threshold`	float	Volume (dB) below which audio is treated as too quiet to be speech	None (no volume filtering)
`interrupt_min_speech_frames`	integer	Speech frames required to register a barge-in / interruption	`2`
`pre_speech_pad_frames`	integer	Audio frames prepended before the detected speech onset so the start isn’t clipped	`9`
`num_initial_ignored_frames`	integer	Leading audio frames skipped at connection start (e.g. setup noise)	`0`

One frame is 512 audio samples — 32 ms at 16 kHz, 64 ms at 8 kHz. All fine VAD parameters are optional; if you only need quicker end-of-speech detection, start with high_vad_sensitivity=true before reaching for these.

Audio Data Parameters

When sending audio data to the streaming endpoint:

Parameter	Type	Description	Required
`audio`	string	Base64-encoded audio data	✅
`encoding`	string	Audio format	✅
`sample_rate`	integer	Audio sample rate (16000 Hz recommended). Must match the connection parameter	✅

Response Types

When vad_signals=true, you’ll receive different message types:

For STT:

speech_start: Voice activity detected
speech_end: Voice activity stopped
transcript: Final transcription result

For STTT:

speech_start: Voice activity detected
speech_end: Voice activity stopped
translation: Final translation result

Key Differences: STT vs STTT

Aspect	STT	STTT
Model	`saaras:v3` (recommended), `saarika:v2.5` (legacy)	`saaras:v3` (recommended), `saaras:v2.5` (legacy)
Method	`transcribe()`	`translate()`
Mode	`transcribe`, `verbatim`, `translit`, `codemix` (saaras:v3 only)	`translate` (saaras:v3 only)
Language Code	Required	Not required (auto-detected)
Output Language	Same as input	English only

Best Practices

Audio Quality & Sample Rate:
- Use 16kHz sample rate for best results
- For 8kHz audio, always set sample_rate=8000 in both connection and transcribe/translate calls
- Ensure both sample rate parameters match your actual audio sample rate
Silence Handling:
- Use 1 second silence when high_vad_sensitivity=false
- Use 0.5 seconds silence when high_vad_sensitivity=true
Continuous Streaming: Send audio data continuously for real-time results
Error Handling: Always implement proper WebSocket error handling
Model Selection:
- Use Saaras (saaras:v3) with mode parameter for the best transcription quality and flexible output modes
- Use Saarika (saarika:v2.5) for transcription in the original language (legacy)
- Use Saaras (saaras:v2.5) for direct translation to English (legacy)