Streaming Speech-to-Text API

Overview

Transform audio into text in real-time with our WebSocket-based streaming API. Built for applications requiring immediate speech processing with minimal delay.

Key Benefits

Ultra-Low Latency

Get transcription results in milliseconds, not seconds. Process speech as it happens with near-instantaneous responses.

Multi-Language Support

Support for 10+ Indian languages plus English with high accuracy transcription and translation capabilities.

Advanced Voice Detection

Smart Voice Activity Detection (VAD) with customizable sensitivity for optimal speech boundary detection.

Common Use Cases

  • Live Transcription: Real-time captions for meetings, webinars, and broadcasts
  • Voice Assistants: Interactive voice applications with immediate responses
  • Call Centers: Live call transcription and analysis
  • Accessibility: Real-time captioning for hearing-impaired users

Audio Format Support: Streaming APIs support .wav and raw PCM formats only. Find sample audio files in our GitHub cookbook.

Getting Started

Get up and running with streaming in minutes. Choose between Speech-to-Text (STT) for transcription or Speech-to-Text Translation (STTT) for direct translation.

Basic Usage

The simplest way to get started with real-time processing:

1import asyncio
2import base64
3from sarvamai import AsyncSarvamAI
4
5# Load your audio file
6with open("path/to/your/audio.wav", "rb") as f:
7 audio_data = base64.b64encode(f.read()).decode("utf-8")
8
9async def basic_transcription():
10 # Initialize client with your API key
11 client = AsyncSarvamAI(api_subscription_key="your-api-key")
12
13 # Connect and transcribe
14 async with client.speech_to_text_streaming.connect(
15 language_code="en-IN", high_vad_sensitivity=True
16 ) as ws:
17 # Send audio for transcription
18 await ws.transcribe(audio=audio_data)
19
20 # Get the result
21 response = await ws.recv()
22 print(f"Transcription: {response}")
23
24# Run the transcription
25asyncio.run(basic_transcription())

Enhanced Processing with Voice Detection

Add smart voice activity detection for better accuracy and control:

1import asyncio
2import base64
3from sarvamai import AsyncSarvamAI
4
5# Load your audio file
6with open("path/to/your/audio.wav", "rb") as f:
7 audio_data = base64.b64encode(f.read()).decode("utf-8")
8
9async def enhanced_transcription():
10 client = AsyncSarvamAI(api_subscription_key="your-api-key")
11
12 async with client.speech_to_text_streaming.connect(
13 language_code="hi-IN", # Hindi (India)
14 model="saarika:v2.5", # Latest model
15 high_vad_sensitivity=True, # Better voice detection
16 vad_signals=True # Get speech start/end signals
17 ) as ws:
18 # Send audio
19 await ws.transcribe(
20 audio=audio_data,
21 encoding="audio/wav",
22 sample_rate=16000
23 )
24
25 # Handle multiple response types
26 async for message in ws:
27 if message.get("type") == "speech_start":
28 print("🎤 Speech detected")
29 elif message.get("type") == "speech_end":
30 print("🔇 Speech ended")
31 elif message.get("type") == "transcript":
32 print(f"📝 Result: {message.get('text')}")
33 break # Got our transcription
34
35# Run the enhanced transcription
36asyncio.run(enhanced_transcription())

Instant Processing with Flush Signals

Force immediate processing without waiting for silence detection:

1import asyncio
2import base64
3from sarvamai import AsyncSarvamAI
4
5# Load your audio file
6with open("path/to/your/audio.wav", "rb") as f:
7 audio_data = base64.b64encode(f.read()).decode("utf-8")
8
9async def instant_processing():
10 client = AsyncSarvamAI(api_subscription_key="your-api-key")
11
12 async with client.speech_to_text_streaming.connect(
13 language_code="en-IN",
14 model="saarika:v2.5",
15 flush_signal=True # Enable manual control
16 ) as ws:
17 # Send audio
18 await ws.transcribe(
19 audio=audio_data,
20 encoding="audio/wav",
21 sample_rate=16000
22 )
23
24 # Force immediate processing
25 await ws.flush()
26 print("⚡ Processing forced - getting immediate results")
27
28 # Get results
29 async for message in ws:
30 print(f"Result: {message}")
31 break
32
33# Run instant processing
34asyncio.run(instant_processing())

Custom Audio Configuration

Optimize for your specific audio setup:

1import asyncio
2import base64
3from sarvamai import AsyncSarvamAI
4
5# Load your audio file
6with open("path/to/your/audio.wav", "rb") as f:
7 audio_data = base64.b64encode(f.read()).decode("utf-8")
8
9async def custom_audio_config():
10 client = AsyncSarvamAI(api_subscription_key="your-api-key")
11
12 async with client.speech_to_text_streaming.connect(
13 language_code="kn-IN",
14 model="saarika:v2.5",
15 sample_rate=8000, # Match your audio
16 input_audio_codec="pcm", # Specify codec
17 high_vad_sensitivity=True # For noisy environments
18 ) as ws:
19 await ws.transcribe(
20 audio=audio_data,
21 encoding="audio/wav",
22 sample_rate=8000 # Must match connection setting
23 )
24
25 response = await ws.recv()
26 print(f"Optimized result: {response}")
27
28# Run custom audio configuration
29asyncio.run(custom_audio_config())

Important: Sample Rate Configuration for 8kHz Audio

When working with 8kHz audio, you must set the sample_rate parameter in both places:

  1. When connecting to the WebSocket (connection parameter)
  2. When sending audio data (transcribe/translate parameter)

Both values must match your audio’s actual sample rate. Mismatched sample rates will result in poor transcription quality or errors.

Example for STT:

1# Set sample_rate when connecting
2async with client.speech_to_text_streaming.connect(
3 language_code="en-IN",
4 sample_rate=8000 # Must match your audio
5) as ws:
6 # Set sample_rate when sending audio
7 await ws.transcribe(
8 audio=audio_data,
9 sample_rate=8000 # Must match connection setting
10 )

Example for STTT:

1# Set sample_rate when connecting
2async with client.speech_to_text_translate_streaming.connect(
3 model="saaras:v2.5",
4 sample_rate=8000 # Must match your audio
5) as ws:
6 # Set sample_rate when sending audio
7 await ws.translate(
8 audio=audio_data,
9 sample_rate=8000 # Must match connection setting
10 )

For detailed endpoint documentation, see: Speech-to-Text WebSocket | Speech-to-Text Translate WebSocket

API Reference

Connection Parameters

Configure your WebSocket connection with these parameters:

ParameterTypeDescriptionExample
language_codestringLanguage for speech recognition (STT only)"en-IN", "hi-IN", "kn-IN"
modelstringModel version to use"saarika:v2.5" (STT), "saaras:v2.5" (STTT)
sample_rateintegerAudio sample rate in Hz. Must match the sample rate in audio data calls8000, 16000
input_audio_codecstringAudio codec format"wav", "pcm"
high_vad_sensitivitybooleanEnhanced voice activity detectiontrue, false
vad_signalsbooleanReceive speech start/end eventstrue, false
flush_signalbooleanEnable manual buffer flushingtrue, false

Audio Data Parameters

When sending audio data to the streaming endpoint:

ParameterTypeDescriptionRequired
audiostringBase64-encoded audio data
encodingstringAudio format
sample_rateintegerAudio sample rate (16000 Hz recommended). Must match the connection parameter

Response Types

When vad_signals=true, you’ll receive different message types:

For STT:

  • speech_start: Voice activity detected
  • speech_end: Voice activity stopped
  • transcript: Final transcription result

For STTT:

  • speech_start: Voice activity detected
  • speech_end: Voice activity stopped
  • translation: Final translation result

Key Differences: STT vs STTT

AspectSTTSTTT
Modelsaarika:v2.5saaras:v2.5
Methodtranscribe()translate()
Language CodeRequiredNot required (auto-detected)
Output LanguageSame as inputEnglish only

Best Practices

  • Audio Quality & Sample Rate:
    • Use 16kHz sample rate for best results
    • For 8kHz audio, always set sample_rate=8000 in both connection and transcribe/translate calls
    • Ensure both sample rate parameters match your actual audio sample rate
  • Silence Handling:
    • Use 1 second silence when high_vad_sensitivity=false
    • Use 0.5 seconds silence when high_vad_sensitivity=true
  • Continuous Streaming: Send audio data continuously for real-time results
  • Error Handling: Always implement proper WebSocket error handling
  • Model Selection:
    • Use Saarika (saarika:v2.5) for transcription in the original language
    • Use Saaras (saaras:v2.5) for direct translation to English