Streaming Speech-to-Text API

Overview

Transform audio into text in real-time with our WebSocket-based streaming API. Built for applications requiring immediate speech processing with minimal delay.

Key Benefits

Ultra-Low Latency

Get transcription results in milliseconds, not seconds. Process speech as it happens with near-instantaneous responses.

Multi-Language Support

Support for 10+ Indian languages plus English with high accuracy transcription and translation capabilities.

Advanced Voice Detection

Smart Voice Activity Detection (VAD) with customizable sensitivity for optimal speech boundary detection.

Common Use Cases

  • Live Transcription: Real-time captions for meetings, webinars, and broadcasts
  • Voice Assistants: Interactive voice applications with immediate responses
  • Call Centers: Live call transcription and analysis
  • Accessibility: Real-time captioning for hearing-impaired users

Audio Format Support: Streaming APIs only support two audio formats:

  • WAV (wav)
  • Raw PCM (pcm_s16le, pcm_l16, pcm_raw)

Other formats like MP3, AAC, OGG, etc. are not supported for WebSocket streaming. Find sample audio files in our GitHub cookbook.

Getting Started

Get up and running with streaming in minutes. Choose between Speech-to-Text (STT) for transcription or Speech-to-Text Translation (STTT) for direct translation.

Basic Usage

The simplest way to get started with real-time processing:

1import asyncio
2import base64
3from sarvamai import AsyncSarvamAI
4
5# Load your audio file
6with open("path/to/your/audio.wav", "rb") as f:
7 audio_data = base64.b64encode(f.read()).decode("utf-8")
8
9async def basic_transcription():
10 # Initialize client with your API key
11 client = AsyncSarvamAI(api_subscription_key="your-api-key")
12
13 # Connect and transcribe
14 async with client.speech_to_text_streaming.connect(
15 language_code="en-IN", high_vad_sensitivity=True
16 ) as ws:
17 # Send audio for transcription
18 await ws.transcribe(audio=audio_data)
19
20 # Get the result
21 response = await ws.recv()
22 print(f"Transcription: {response}")
23
24# Run the transcription
25asyncio.run(basic_transcription())

Enhanced Processing with Voice Detection

Add smart voice activity detection for better accuracy and control:

1import asyncio
2import base64
3from sarvamai import AsyncSarvamAI
4
5# Load your audio file
6with open("path/to/your/audio.wav", "rb") as f:
7 audio_data = base64.b64encode(f.read()).decode("utf-8")
8
9async def enhanced_transcription():
10 client = AsyncSarvamAI(api_subscription_key="your-api-key")
11
12 async with client.speech_to_text_streaming.connect(
13 language_code="hi-IN", # Hindi (India)
14 model="saarika:v2.5", # Latest model
15 high_vad_sensitivity=True, # Better voice detection
16 vad_signals=True # Get speech start/end signals
17 ) as ws:
18 # Send audio
19 await ws.transcribe(
20 audio=audio_data,
21 encoding="audio/wav",
22 sample_rate=16000
23 )
24
25 # Handle multiple response types
26 async for message in ws:
27 if message.get("type") == "speech_start":
28 print("🎤 Speech detected")
29 elif message.get("type") == "speech_end":
30 print("🔇 Speech ended")
31 elif message.get("type") == "transcript":
32 print(f"📝 Result: {message.get('text')}")
33 break # Got our transcription
34
35# Run the enhanced transcription
36asyncio.run(enhanced_transcription())

Instant Processing with Flush Signals

Force immediate processing without waiting for silence detection:

1import asyncio
2import base64
3from sarvamai import AsyncSarvamAI
4
5# Load your audio file
6with open("path/to/your/audio.wav", "rb") as f:
7 audio_data = base64.b64encode(f.read()).decode("utf-8")
8
9async def instant_processing():
10 client = AsyncSarvamAI(api_subscription_key="your-api-key")
11
12 async with client.speech_to_text_streaming.connect(
13 language_code="en-IN",
14 model="saarika:v2.5",
15 flush_signal=True # Enable manual control
16 ) as ws:
17 # Send audio
18 await ws.transcribe(
19 audio=audio_data,
20 encoding="audio/wav",
21 sample_rate=16000
22 )
23
24 # Force immediate processing
25 await ws.flush()
26 print("⚡ Processing forced - getting immediate results")
27
28 # Get results
29 async for message in ws:
30 print(f"Result: {message}")
31 break
32
33# Run instant processing
34asyncio.run(instant_processing())

Custom Audio Configuration

Optimize for your specific audio setup:

1import asyncio
2import base64
3from sarvamai import AsyncSarvamAI
4
5# Load your audio file
6with open("path/to/your/audio.wav", "rb") as f:
7 audio_data = base64.b64encode(f.read()).decode("utf-8")
8
9async def custom_audio_config():
10 client = AsyncSarvamAI(api_subscription_key="your-api-key")
11
12 async with client.speech_to_text_streaming.connect(
13 language_code="kn-IN",
14 model="saarika:v2.5",
15 sample_rate=8000, # Match your audio
16 input_audio_codec="pcm_s16le", # Specify codec (wav, pcm_s16le, pcm_l16, pcm_raw)
17 high_vad_sensitivity=True # For noisy environments
18 ) as ws:
19 await ws.transcribe(
20 audio=audio_data,
21 encoding="audio/wav",
22 sample_rate=8000 # Must match connection setting
23 )
24
25 response = await ws.recv()
26 print(f"Optimized result: {response}")
27
28# Run custom audio configuration
29asyncio.run(custom_audio_config())

Important: Sample Rate Configuration for 8kHz Audio

When working with 8kHz audio, you must set the sample_rate parameter in both places:

  1. When connecting to the WebSocket (connection parameter)
  2. When sending audio data (transcribe/translate parameter)

Both values must match your audio’s actual sample rate. Mismatched sample rates will result in poor transcription quality or errors.

Example for STT:

1# Set sample_rate when connecting
2async with client.speech_to_text_streaming.connect(
3 language_code="en-IN",
4 sample_rate=8000 # Must match your audio
5) as ws:
6 # Set sample_rate when sending audio
7 await ws.transcribe(
8 audio=audio_data,
9 sample_rate=8000 # Must match connection setting
10 )

Example for STTT:

1# Set sample_rate when connecting
2async with client.speech_to_text_translate_streaming.connect(
3 model="saaras:v2.5",
4 sample_rate=8000 # Must match your audio
5) as ws:
6 # Set sample_rate when sending audio
7 await ws.translate(
8 audio=audio_data,
9 sample_rate=8000 # Must match connection setting
10 )

For detailed endpoint documentation, see: Speech-to-Text WebSocket | Speech-to-Text Translate WebSocket

API Reference

Connection Parameters

Configure your WebSocket connection with these parameters:

ParameterTypeDescriptionExample
language_codestringLanguage for speech recognition (STT only)"en-IN", "hi-IN", "kn-IN"
modelstringModel version to use"saarika:v2.5" (STT), "saaras:v2.5" (STTT)
sample_rateintegerAudio sample rate in Hz8000, 16000
input_audio_codecstringAudio codec format. Only wav and raw PCM formats (pcm_s16le, pcm_l16, pcm_raw) are supported"wav", "pcm_s16le"
high_vad_sensitivitybooleanEnhanced voice activity detectiontrue, false
vad_signalsbooleanReceive speech start/end eventstrue, false
flush_signalbooleanEnable manual buffer flushingtrue, false

Audio Data Parameters

When sending audio data to the streaming endpoint:

ParameterTypeDescriptionRequired
audiostringBase64-encoded audio data
encodingstringAudio format
sample_rateintegerAudio sample rate (16000 Hz recommended). Must match the connection parameter

Response Types

When vad_signals=true, you’ll receive different message types:

For STT:

  • speech_start: Voice activity detected
  • speech_end: Voice activity stopped
  • transcript: Final transcription result

For STTT:

  • speech_start: Voice activity detected
  • speech_end: Voice activity stopped
  • translation: Final translation result

Key Differences: STT vs STTT

AspectSTTSTTT
Modelsaarika:v2.5saaras:v2.5
Methodtranscribe()translate()
Language CodeRequiredNot required (auto-detected)
Output LanguageSame as inputEnglish only

Best Practices

  • Audio Quality & Sample Rate:
    • Use 16kHz sample rate for best results
    • For 8kHz audio, always set sample_rate=8000 in both connection and transcribe/translate calls
    • Ensure both sample rate parameters match your actual audio sample rate
  • Silence Handling:
    • Use 1 second silence when high_vad_sensitivity=false
    • Use 0.5 seconds silence when high_vad_sensitivity=true
  • Continuous Streaming: Send audio data continuously for real-time results
  • Error Handling: Always implement proper WebSocket error handling
  • Model Selection:
    • Use Saarika (saarika:v2.5) for transcription in the original language
    • Use Saaras (saaras:v2.5) for direct translation to English