For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
CommunityAPI StatusAPI PricingSign Up
DocumentationAPI ReferencesCookbookIntegrationDeveloper Tools
DocumentationAPI ReferencesCookbookIntegrationDeveloper Tools
  • Getting Started
    • Welcome
    • Quickstart
    • SDKs & Libraries
    • Building for Indian Languages
    • Models
    • Credits & Rate Limits
    • Errors & Troubleshooting
    • Talk to us
    • Pricing
    • Changelog
  • API Guides & Tutorials
      • Overview
      • Which API to Use
      • Rest API
      • Batch API
      • Streaming API
      • FAQs
LogoLogo
CommunityAPI StatusAPI PricingSign Up
On this page
  • Overview
  • Supported Modes (Saaras v3)
  • Key Benefits
  • Common Use Cases
  • Getting Started
  • Choosing a Mode
  • Full Example
  • Enhanced Processing with Voice Detection
  • Instant Processing with Flush Signals
  • Custom Audio Configuration
  • Handling Disconnects
  • Voice-Agent Barge-In
  • API Reference
  • Connection Parameters
  • Audio Data Parameters
  • Response Types
  • Key Differences: STT vs STTT
  • Best Practices
API Guides & TutorialsSpeech to Text

Streaming Speech-to-Text API

||View as Markdown|
Was this page helpful?
Previous

How to select output mode

Next
Built with

Overview

Transform audio into text in real-time with our WebSocket-based streaming API. Built for applications requiring immediate speech processing with minimal delay.

For complete API reference documentation, see the Speech-to-Text API Reference section.

Model Availability: The Streaming API supports Saaras v3 (recommended) with multiple output modes via the mode parameter. Legacy models Saarika v2.5 and Saaras v2.5 are also available but we recommend switching to Saaras v3 for the best accuracy and features.

Supported Modes (Saaras v3)

ModeDescriptionOutput
transcribeStandard transcription in the original languageText in source language
translateTranscribe and translate to EnglishEnglish text
verbatimWord-for-word transcription including filler words and repetitionsVerbatim text in source language
translitTranscribe and transliterate to Roman scriptRomanized text
codemixTranscribe code-mixed speech (e.g., Hindi-English) naturallyCode-mixed text

Key Benefits

Ultra-Low Latency

Get transcription results in milliseconds, not seconds. Process speech as it happens with near-instantaneous responses.

Multi-Language Support

Support for 10+ Indian languages plus English with high accuracy transcription and translation capabilities.

Advanced Voice Detection

Smart Voice Activity Detection (VAD) with customizable sensitivity for optimal speech boundary detection.

Common Use Cases

  • Live Transcription: Real-time captions for meetings, webinars, and broadcasts
  • Voice Assistants: Interactive voice applications with immediate responses
  • Call Centers: Live call transcription and analysis
  • Accessibility: Real-time captioning for hearing-impaired users

Audio Format Support: Streaming APIs only support two audio formats:

  • WAV (wav)
  • Raw PCM (pcm_s16le, pcm_l16, pcm_raw)

Other formats like MP3, AAC, OGG, etc. are not supported for WebSocket streaming. Find sample audio files in our GitHub cookbook.

Getting Started

Get up and running with streaming in minutes. Simply change the mode parameter to switch between transcription, translation, and other output formats.

Choosing a Mode

Transcribe
Translate
Verbatim
Translit
Codemix

Transcribe audio in the original language.

1async with client.speech_to_text_streaming.connect(
2 model="saaras:v3",
3 mode="transcribe", # Standard transcription
4 language_code="en-IN",
5 high_vad_sensitivity=True
6) as ws:
7 await ws.transcribe(audio=audio_data)
8 response = await ws.recv()
9 print(f"Transcription: {response}")

Full Example

Here’s a complete working example. Change the mode parameter to switch between any of the supported modes:

1import asyncio
2import base64
3from sarvamai import AsyncSarvamAI
4
5# Load your audio file
6with open("path/to/your/audio.wav", "rb") as f:
7 audio_data = base64.b64encode(f.read()).decode("utf-8")
8
9async def basic_transcription():
10 # Initialize client with your API key
11 client = AsyncSarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")
12
13 # Connect and transcribe — change mode as needed
14 async with client.speech_to_text_streaming.connect(
15 model="saaras:v3",
16 mode="transcribe",
17 language_code="en-IN",
18 high_vad_sensitivity=True
19 ) as ws:
20 await ws.transcribe(audio=audio_data)
21 response = await ws.recv()
22 print(f"Result: {response}")
23
24asyncio.run(basic_transcription())

Enhanced Processing with Voice Detection

Add smart voice activity detection for better accuracy and control:

1import asyncio
2import base64
3from sarvamai import AsyncSarvamAI
4
5with open("path/to/your/audio.wav", "rb") as f:
6 audio_data = base64.b64encode(f.read()).decode("utf-8")
7
8async def enhanced_transcription():
9 client = AsyncSarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")
10
11 async with client.speech_to_text_streaming.connect(
12 model="saaras:v3",
13 mode="transcribe", # Change mode as needed
14 language_code="hi-IN",
15 high_vad_sensitivity=True, # Better voice detection
16 vad_signals=True # Get speech start/end signals
17 ) as ws:
18 await ws.transcribe(
19 audio=audio_data,
20 encoding="audio/wav",
21 sample_rate=16000
22 )
23
24 async for message in ws:
25 if message.type == "events":
26 # VAD signals arrive as events (signal_type is START_SPEECH / END_SPEECH)
27 print(f"Voice activity: {message.data.signal_type}")
28 elif message.type == "data":
29 print(f"Result: {message.data.transcript}")
30 break
31
32asyncio.run(enhanced_transcription())

Instant Processing with Flush Signals

Force immediate processing without waiting for silence detection:

1import asyncio
2import base64
3from sarvamai import AsyncSarvamAI
4
5with open("path/to/your/audio.wav", "rb") as f:
6 audio_data = base64.b64encode(f.read()).decode("utf-8")
7
8async def instant_processing():
9 client = AsyncSarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")
10
11 async with client.speech_to_text_streaming.connect(
12 model="saaras:v3",
13 mode="transcribe", # Change mode as needed
14 language_code="en-IN",
15 flush_signal=True # Enable manual control
16 ) as ws:
17 await ws.transcribe(
18 audio=audio_data,
19 encoding="audio/wav",
20 sample_rate=16000
21 )
22
23 # Force immediate processing
24 await ws.flush()
25
26 async for message in ws:
27 print(f"Result: {message}")
28 break
29
30asyncio.run(instant_processing())

Custom Audio Configuration

Optimize for your specific audio setup (e.g., 8kHz telephony audio):

1import asyncio
2import base64
3from sarvamai import AsyncSarvamAI
4
5with open("path/to/your/audio.wav", "rb") as f:
6 audio_data = base64.b64encode(f.read()).decode("utf-8")
7
8async def custom_audio_config():
9 client = AsyncSarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")
10
11 async with client.speech_to_text_streaming.connect(
12 model="saaras:v3",
13 mode="transcribe", # Change mode as needed
14 language_code="kn-IN",
15 sample_rate=8000, # Match your audio
16 input_audio_codec="pcm_s16le", # Specify codec
17 high_vad_sensitivity=True
18 ) as ws:
19 await ws.transcribe(
20 audio=audio_data,
21 encoding="audio/wav",
22 sample_rate=8000 # Must match connection setting
23 )
24
25 response = await ws.recv()
26 print(f"Result: {response}")
27
28asyncio.run(custom_audio_config())

Important: Sample Rate Configuration for 8kHz Audio

When working with 8kHz audio, you must set the sample_rate parameter in both places:

  1. When connecting to the WebSocket (connection parameter)
  2. When sending audio data (transcribe parameter)

Both values must match your audio’s actual sample rate. Mismatched sample rates will result in poor transcription quality or errors.

1async with client.speech_to_text_streaming.connect(
2 model="saaras:v3",
3 mode="transcribe",
4 language_code="en-IN",
5 sample_rate=8000 # Must match your audio
6) as ws:
7 await ws.transcribe(
8 audio=audio_data,
9 sample_rate=8000 # Must match connection setting
10 )

For detailed endpoint documentation, see: Speech-to-Text WebSocket | Speech-to-Text Translate WebSocket

Handling Disconnects

Long-lived sockets will occasionally drop (network blips, idle timeouts, server restarts). Inspect the WebSocket close code and reconnect with backoff.

Close codeMeaningWhat to do
1000Normal closureYou called close() — nothing to do
1001Going awayServer or client shutting down — reconnect
1006Abnormal closure (no close frame)Network drop — reconnect with backoff
1011Server errorRetry with backoff; if persistent, check status
4xxxApplication-specificRead the close reason for details (e.g. auth or quota); fix before reconnecting

Codes 1000–1015 are standard WebSocket codes. Any 4000–4999 code is application-specific — always read the accompanying close reason string rather than assuming a fixed meaning.

Reconnect with exponential backoff (pseudocode — applies to both SDKs):

attempt = 0
while not connected and attempt < MAX_ATTEMPTS:
try:
open WebSocket and resume streaming
attempt = 0 # reset on success
except (close 1006 / 1011 / network error):
delay = min(BASE * 2 ** attempt, MAX_DELAY) # e.g. 0.5s, 1s, 2s, 4s ... capped
sleep(delay + small random jitter)
attempt += 1
on close 4xxx (auth/quota): stop and surface the error # do not blind-retry

Do not auto-retry on 4xxx auth/quota closes — fix the underlying issue first (see Errors & Troubleshooting).

Voice-Agent Barge-In

In a voice agent, the user may start speaking while your TTS reply is still playing (“barge-in”). Use vad_signals=true and treat the START_SPEECH event as the cue to stop playback immediately and let the user take the turn.

1async with client.speech_to_text_streaming.connect(
2 model="saaras:v3",
3 mode="transcribe",
4 language_code="hi-IN",
5 high_vad_sensitivity=True, # 0.5s silence boundary — snappier for conversation
6 vad_signals=True, # emit START_SPEECH / END_SPEECH events
7) as ws:
8 await ws.transcribe(audio=mic_chunk, encoding="audio/wav", sample_rate=16000)
9
10 async for message in ws:
11 if message.type == "events" and message.data.signal_type == "START_SPEECH":
12 tts_player.stop() # barge-in: cut off the agent's current reply
13 elif message.type == "data":
14 handle_user_turn(message.data.transcript)

For conversational use, prefer high_vad_sensitivity=True (0.5s silence boundary) so the agent reacts quickly. See the LiveKit and Pipecat voice-agent integration guides for full agent setups, and Credits & Rate Limits for concurrency limits on streaming connections.

API Reference

Connection Parameters

Configure your WebSocket connection with these parameters:

ParameterTypeDescriptionExample
language_codestringLanguage for speech recognition (STT only)"en-IN", "hi-IN", "kn-IN"
modelstringModel version to use"saaras:v3" (recommended), "saarika:v2.5" (legacy), "saaras:v2.5" (legacy)
modestringOutput mode (saaras:v3 only): transcribe, translate, verbatim, translit, codemix"transcribe"
sample_rateintegerAudio sample rate in Hz8000, 16000
input_audio_codecstringAudio codec format. Only wav and raw PCM formats (pcm_s16le, pcm_l16, pcm_raw) are supported"wav", "pcm_s16le"
high_vad_sensitivitybooleanEnhanced voice activity detectiontrue, false
vad_signalsbooleanReceive speech start/end eventstrue, false
flush_signalbooleanEnable manual buffer flushingtrue, false

Audio Data Parameters

When sending audio data to the streaming endpoint:

ParameterTypeDescriptionRequired
audiostringBase64-encoded audio data✅
encodingstringAudio format✅
sample_rateintegerAudio sample rate (16000 Hz recommended). Must match the connection parameter✅

Response Types

When vad_signals=true, you’ll receive different message types:

For STT:

  • speech_start: Voice activity detected
  • speech_end: Voice activity stopped
  • transcript: Final transcription result

For STTT:

  • speech_start: Voice activity detected
  • speech_end: Voice activity stopped
  • translation: Final translation result

Key Differences: STT vs STTT

AspectSTTSTTT
Modelsaaras:v3 (recommended), saarika:v2.5 (legacy)saaras:v3 (recommended), saaras:v2.5 (legacy)
Methodtranscribe()translate()
Modetranscribe, verbatim, translit, codemix (saaras:v3 only)translate (saaras:v3 only)
Language CodeRequiredNot required (auto-detected)
Output LanguageSame as inputEnglish only

Best Practices

  • Audio Quality & Sample Rate:
    • Use 16kHz sample rate for best results
    • For 8kHz audio, always set sample_rate=8000 in both connection and transcribe/translate calls
    • Ensure both sample rate parameters match your actual audio sample rate
  • Silence Handling:
    • Use 1 second silence when high_vad_sensitivity=false
    • Use 0.5 seconds silence when high_vad_sensitivity=true
  • Continuous Streaming: Send audio data continuously for real-time results
  • Error Handling: Always implement proper WebSocket error handling
  • Model Selection:
    • Use Saaras (saaras:v3) with mode parameter for the best transcription quality and flexible output modes
    • Use Saarika (saarika:v2.5) for transcription in the original language (legacy)
    • Use Saaras (saaras:v2.5) for direct translation to English (legacy)