Streaming Speech-to-Text API

Overview

Transform audio into text in real-time with our WebSocket-based streaming API. Built for applications requiring immediate speech processing with minimal delay.

Key Benefits

Ultra-Low Latency

Get transcription results in milliseconds, not seconds. Process speech as it happens with near-instantaneous responses.

Multi-Language Support

Support for 10+ Indian languages plus English with high accuracy transcription and translation capabilities.

Advanced Voice Detection

Smart Voice Activity Detection (VAD) with customizable sensitivity for optimal speech boundary detection.

Common Use Cases

Live Transcription: Real-time captions for meetings, webinars, and broadcasts
Voice Assistants: Interactive voice applications with immediate responses
Call Centers: Live call transcription and analysis
Accessibility: Real-time captioning for hearing-impaired users

Audio Format Support: Streaming APIs support .wav and raw PCM formats only. Find sample audio files in our GitHub cookbook.

Getting Started

Get up and running with streaming in minutes. Choose between Speech-to-Text (STT) for transcription or Speech-to-Text Translation (STTT) for direct translation.

Basic Usage

The simplest way to get started with real-time processing:

Speech-to-Text (STT)

Speech-to-Text Translation (STTT)

Python

JavaScript

1 import asyncio
2 import base64
3 from sarvamai import AsyncSarvamAI
4 
5 # Load your audio file
6 with open("path/to/your/audio.wav", "rb") as f:
7     audio_data = base64.b64encode(f.read()).decode("utf-8")
8 
9 async def basic_transcription():
10     # Initialize client with your API key
11     client = AsyncSarvamAI(api_subscription_key="your-api-key")
12 
13     # Connect and transcribe
14     async with client.speech_to_text_streaming.connect(
15         language_code="en-IN", high_vad_sensitivity=True
16     ) as ws:
17         # Send audio for transcription
18         await ws.transcribe(audio=audio_data)
19         
20         # Get the result
21         response = await ws.recv()
22         print(f"Transcription: {response}")
23 
24 # Run the transcription
25 asyncio.run(basic_transcription())

Enhanced Processing with Voice Detection

Add smart voice activity detection for better accuracy and control:

Speech-to-Text (STT)

Speech-to-Text Translation (STTT)

Python

JavaScript

1 import asyncio
2 import base64
3 from sarvamai import AsyncSarvamAI
4 
5 # Load your audio file
6 with open("path/to/your/audio.wav", "rb") as f:
7     audio_data = base64.b64encode(f.read()).decode("utf-8")
8 
9 async def enhanced_transcription():
10     client = AsyncSarvamAI(api_subscription_key="your-api-key")
11 
12     async with client.speech_to_text_streaming.connect(
13         language_code="hi-IN",           # Hindi (India)
14         model="saarika:v2.5",           # Latest model
15         high_vad_sensitivity=True,       # Better voice detection
16         vad_signals=True                # Get speech start/end signals
17     ) as ws:
18         # Send audio
19         await ws.transcribe(
20             audio=audio_data,
21             encoding="audio/wav",
22             sample_rate=16000
23         )
24         
25         # Handle multiple response types
26         async for message in ws:
27             if message.get("type") == "speech_start":
28                 print("🎤 Speech detected")
29             elif message.get("type") == "speech_end":
30                 print("🔇 Speech ended")
31             elif message.get("type") == "transcript":
32                 print(f"📝 Result: {message.get('text')}")
33                 break  # Got our transcription
34 
35 # Run the enhanced transcription
36 asyncio.run(enhanced_transcription())

Instant Processing with Flush Signals

Force immediate processing without waiting for silence detection:

Speech-to-Text (STT)

Speech-to-Text Translation (STTT)

Python

JavaScript

1 import asyncio
2 import base64
3 from sarvamai import AsyncSarvamAI
4 
5 # Load your audio file
6 with open("path/to/your/audio.wav", "rb") as f:
7     audio_data = base64.b64encode(f.read()).decode("utf-8")
8 
9 async def instant_processing():
10     client = AsyncSarvamAI(api_subscription_key="your-api-key")
11 
12     async with client.speech_to_text_streaming.connect(
13         language_code="en-IN",
14         model="saarika:v2.5",
15         flush_signal=True  # Enable manual control
16     ) as ws:
17         # Send audio
18         await ws.transcribe(
19             audio=audio_data,
20             encoding="audio/wav",
21             sample_rate=16000
22         )
23         
24         # Force immediate processing
25         await ws.flush()
26         print("⚡ Processing forced - getting immediate results")
27 
28         # Get results
29         async for message in ws:
30             print(f"Result: {message}")
31             break
32 
33 # Run instant processing
34 asyncio.run(instant_processing())

Custom Audio Configuration

Optimize for your specific audio setup:

Speech-to-Text (STT)

Speech-to-Text Translation (STTT)

Python

JavaScript

1 import asyncio
2 import base64
3 from sarvamai import AsyncSarvamAI
4 
5 # Load your audio file
6 with open("path/to/your/audio.wav", "rb") as f:
7     audio_data = base64.b64encode(f.read()).decode("utf-8")
8 
9 async def custom_audio_config():
10     client = AsyncSarvamAI(api_subscription_key="your-api-key")
11 
12     async with client.speech_to_text_streaming.connect(
13         language_code="kn-IN",
14         model="saarika:v2.5",
15         sample_rate=8000,           # Match your audio
16         input_audio_codec="pcm",    # Specify codec
17         high_vad_sensitivity=True   # For noisy environments
18     ) as ws:
19         await ws.transcribe(
20             audio=audio_data,
21             encoding="audio/wav",
22             sample_rate=8000  # Must match connection setting
23         )
24         
25         response = await ws.recv()
26         print(f"Optimized result: {response}")
27 
28 # Run custom audio configuration
29 asyncio.run(custom_audio_config())

Important: Sample Rate Configuration for 8kHz Audio

When working with 8kHz audio, you must set the sample_rate parameter in both places:

When connecting to the WebSocket (connection parameter)
When sending audio data (transcribe/translate parameter)

Both values must match your audio’s actual sample rate. Mismatched sample rates will result in poor transcription quality or errors.

Example for STT:

1 # Set sample_rate when connecting
2 async with client.speech_to_text_streaming.connect(
3     language_code="en-IN",
4     sample_rate=8000  # Must match your audio
5 ) as ws:
6     # Set sample_rate when sending audio
7     await ws.transcribe(
8         audio=audio_data,
9         sample_rate=8000  # Must match connection setting
10     )

Example for STTT:

1 # Set sample_rate when connecting
2 async with client.speech_to_text_translate_streaming.connect(
3     model="saaras:v2.5",
4     sample_rate=8000  # Must match your audio
5 ) as ws:
6     # Set sample_rate when sending audio
7     await ws.translate(
8         audio=audio_data,
9         sample_rate=8000  # Must match connection setting
10     )

For detailed endpoint documentation, see: Speech-to-Text WebSocket | Speech-to-Text Translate WebSocket

API Reference

Connection Parameters

Configure your WebSocket connection with these parameters:

Parameter	Type	Description	Example
`language_code`	string	Language for speech recognition (STT only)	`"en-IN"`, `"hi-IN"`, `"kn-IN"`
`model`	string	Model version to use	`"saarika:v2.5"` (STT), `"saaras:v2.5"` (STTT)
`sample_rate`	integer	Audio sample rate in Hz. Must match the sample rate in audio data calls	`8000`, `16000`
`input_audio_codec`	string	Audio codec format	`"wav"`, `"pcm"`
`high_vad_sensitivity`	boolean	Enhanced voice activity detection	`true`, `false`
`vad_signals`	boolean	Receive speech start/end events	`true`, `false`
`flush_signal`	boolean	Enable manual buffer flushing	`true`, `false`

Audio Data Parameters

When sending audio data to the streaming endpoint:

Parameter	Type	Description	Required
`audio`	string	Base64-encoded audio data	✅
`encoding`	string	Audio format	✅
`sample_rate`	integer	Audio sample rate (16000 Hz recommended). Must match the connection parameter	✅

Response Types

When vad_signals=true, you’ll receive different message types:

For STT:

speech_start: Voice activity detected
speech_end: Voice activity stopped
transcript: Final transcription result

For STTT:

speech_start: Voice activity detected
speech_end: Voice activity stopped
translation: Final translation result

Key Differences: STT vs STTT

Aspect	STT	STTT
Model	`saarika:v2.5`	`saaras:v2.5`
Method	`transcribe()`	`translate()`
Language Code	Required	Not required (auto-detected)
Output Language	Same as input	English only

Best Practices

Audio Quality & Sample Rate:
- Use 16kHz sample rate for best results
- For 8kHz audio, always set sample_rate=8000 in both connection and transcribe/translate calls
- Ensure both sample rate parameters match your actual audio sample rate
Silence Handling:
- Use 1 second silence when high_vad_sensitivity=false
- Use 0.5 seconds silence when high_vad_sensitivity=true
Continuous Streaming: Send audio data continuously for real-time results
Error Handling: Always implement proper WebSocket error handling
Model Selection:
- Use Saarika (saarika:v2.5) for transcription in the original language
- Use Saaras (saaras:v2.5) for direct translation to English