Streaming Speech-to-Text API

Real-time Processing

Process audio streams in real-time with WebSocket connections. Ideal for:

  • Live transcription
  • Real-time translation
  • Interactive applications
  • Low-latency requirements

Features

Real-time Processing
  • Live audio transcription - WebSocket-based streaming - Low latency responses
Language Support
  • Multiple Indian languages and English support
  • Language code specification (e.g., “kn-IN” for Kannada)
  • High accuracy transcription
Integration
  • Python and JavaScript SDK with async support
  • WebSocket connections
  • Easy-to-use API interface
  • WebSockets of STT and STTT only support .wav and raw PCM

Best Practices

  • Send a continuous stream of audio data
  • Use 1 second of silence when VAD (Voice Activity Detection) sensitivity is FALSE
  • Use 0.5 seconds of silence when VAD (Voice Activity Detection) sensitivity is TRUE
  • One can send arbitrary length of audio

You can find sample audio files and their corresponding base64 encoded strings in the

GitHub cookbook

Saarika: Our Speech to Text Transcription Model

Basic Streaming Transcription

1import asyncio
2import base64
3from sarvamai import AsyncSarvamAI
4
5# Load and encode audio file
6with open("path/to/your/audio.wav", "rb") as f:
7 audio_data = base64.b64encode(f.read()).decode("utf-8")
8
9async def transcribe_stream():
10 client = AsyncSarvamAI(api_subscription_key="your-api-key")
11
12 # Connect to streaming transcription
13 async with client.speech_to_text_streaming.connect(language_code="kn-IN") as ws:
14 # Send audio data
15 await ws.transcribe(audio=audio_data)
16 print("[Debug]: sent audio message")
17
18 # Receive transcription response
19 resp = await ws.recv()
20 print(f"[Debug]: received response: {resp}")
21
22if __name__ == "__main__":
23 asyncio.run(transcribe_stream())
24
25# --- Notebook/Colab usage ---
26# await transcribe_stream()

Streaming Guide

  • language_code: Specifies the language for speech recognition (e.g., en-IN, hi-IN, kn-IN, etc.).

  • model: Selects the speech-to-text model version (e.g., saarika:v2.5).

  • high_vad_sensitivity: Enables high sensitivity for Voice Activity Detection (VAD), helpful in noisy or soft speech environments.

  • vad_signals: When enabled, provides VAD event signals in the response stream:

    • “speech_start”: Indicates the beginning of speech detection
    • ”speech_end”: Indicates the end of speech detection
    • ”transcript”: Contains the final transcription after speech end
  • sample_rate: Specifies the sample rate of the input audio in Hz (e.g., 8000, 16000, 44100). Allows for optimal processing based on your audio quality.

New Streaming Features

Sample Rate and Input Audio Codec Support

STT streaming now supports specifying both the sample rate and input audio codec for better audio processing optimization.

1import asyncio
2import base64
3from sarvamai import AsyncSarvamAI
4
5# Load and encode audio file
6with open("path/to/your/audio.wav", "rb") as f:
7 audio_data = base64.b64encode(f.read()).decode("utf-8")
8
9async def transcribe_with_codec_and_sample_rate():
10 client = AsyncSarvamAI(api_subscription_key="your-api-key")
11
12 async with client.speech_to_text_streaming.connect(
13 language_code="kn-IN",
14 input_audio_codec="pcm",
15 sample_rate=16000,
16 ) as ws:
17 await ws.transcribe(audio=audio_data)
18 resp = await ws.recv()
19 print(resp)
20
21if __name__ == "__main__":
22 asyncio.run(transcribe_with_codec_and_sample_rate())

Example Usage

Basic Streaming

1import base64
2import asyncio
3from sarvamai import AsyncSarvamAI
4
5# Load and encode audio file
6with open("path/to/your/audio.wav", "rb") as f:
7 audio_data = base64.b64encode(f.read()).decode("utf-8")
8
9async def transcribe_stream():
10 client = AsyncSarvamAI(api_subscription_key="your-api-key")
11
12 async with client.speech_to_text_streaming.connect(
13 language_code="en-IN",
14 model="saarika:v2.5",
15 high_vad_sensitivity=True,
16 ) as ws:
17 await ws.transcribe(
18 audio=audio_data,
19 encoding="audio/wav",
20 sample_rate=16000
21 )
22 response = await ws.recv()
23 print(response)
24
25if __name__ == "__main__":
26 asyncio.run(transcribe_stream())

Streaming with VAD Signals

When using VAD signals, the API returns multiple messages in sequence. You’ll need to handle these messages appropriately and wait for the complete sequence (speech_start → speech_end → transcript). Here’s an example:

1import base64
2import asyncio
3import contextlib
4from sarvamai import AsyncSarvamAI
5
6# Load and encode audio file
7with open("path/to/your/audio.wav", "rb") as f:
8 audio_data = base64.b64encode(f.read()).decode("utf-8")
9
10async def transcribe_stream():
11 client = AsyncSarvamAI(api_subscription_key="your-api-key")
12
13 async with client.speech_to_text_streaming.connect(
14 language_code="en-IN",
15 model="saarika:v2.5",
16 high_vad_sensitivity=True,
17 vad_signals=True,
18 ) as ws:
19 await ws.transcribe(
20 audio=audio_data,
21 encoding="audio/wav",
22 sample_rate=16000
23 )
24 print("[Debug]: Sent audio")
25
26 # Gracefully wait for streaming responses
27 with contextlib.suppress(asyncio.TimeoutError):
28 async with asyncio.timeout(10): # Adjust timeout as needed
29 async for message in ws:
30 print(message)
31 # Example message sequence:
32 # {"type": "speech_start"}
33 # {"type": "speech_end"}
34 # {"type": "transcript", "text": "Your transcribed text here"}
35
36if __name__ == "__main__":
37 asyncio.run(transcribe_stream())

Note: When using vad_signals=True, expect a slight delay between receiving the “speech_end” signal and the final transcript. This delay allows the model to process the complete audio segment and generate accurate transcription. The timeout in the example above can be adjusted based on your audio length and requirements.

Saaras Model: Our Speech to Text Translation Model

Streaming Translation

1import asyncio
2import base64
3from sarvamai import AsyncSarvamAI
4
5# Load and encode audio file
6with open("path/to/your/audio.wav", "rb") as f:
7 audio_data = base64.b64encode(f.read()).decode("utf-8")
8
9async def translate_stream():
10 client = AsyncSarvamAI(api_subscription_key="your-api-key")
11
12 # Connect to streaming translation
13 async with client.speech_to_text_translate_streaming.connect() as ws:
14 # Send audio for translation
15 await ws.translate(audio=audio_data)
16 print("[Debug]: sent audio message")
17
18 # Receive translation response
19 resp = await ws.recv()
20 print(f"[Debug]: received response: {resp}")
21
22if __name__ == "__main__":
23 asyncio.run(translate_stream())
24
25# --- Notebook/Colab usage ---
26# await translate_stream()

STT Translation Streaming Guide

  • model: Selects the speech-to-text model version (e.g., saaras:v2.5).

  • high_vad_sensitivity: Enables high sensitivity for Voice Activity Detection (VAD), helpful in noisy or soft speech environments.

  • vad_signals: When enabled, provides VAD event signals in the response stream:

    • “speech_start”: Indicates the beginning of speech detection
    • ”speech_end”: Indicates the end of speech detection
    • ”transcript”: Contains the transcription and translation after speech end
  • sample_rate: Specifies the sample rate of the input audio in Hz (e.g., 8000, 16000, 44100). Allows for optimal processing based on your audio quality.

Example Usage

Basic Streaming

1import asyncio
2import base64
3from sarvamai import AsyncSarvamAI
4
5# Load and encode audio file
6with open("path/to/your/audio.wav", "rb") as f:
7 audio_data = base64.b64encode(f.read()).decode("utf-8")
8
9async def translate_stream():
10 client = AsyncSarvamAI(api_subscription_key="your-api-key")
11
12 async with client.speech_to_text_translate_streaming.connect(
13 model="saaras:v2.5"
14 ) as ws:
15 await ws.translate(
16 audio=audio_data,
17 encoding="audio/wav",
18 sample_rate=16000
19 )
20 response = await ws.recv()
21 print(response)
22
23if __name__ == "__main__":
24 asyncio.run(translate_stream())

Streaming with VAD Signals

When using VAD signals, the API returns multiple messages in sequence. You’ll need to handle these messages appropriately and wait for the complete sequence (speech_start → speech_end → transcript). Here’s an example:

1import asyncio
2import contextlib
3from sarvamai import AsyncSarvamAI
4import base64
5# Load and encode audio file
6with open("path/to/your/audio.wav", "rb") as f:
7 audio_data = base64.b64encode(f.read()).decode("utf-8")
8
9async def translate_stream():
10 client = AsyncSarvamAI(api_subscription_key="YOUR_API_SUBSCRIPTION_KEY")
11
12 async with client.speech_to_text_translate_streaming.connect(
13 model="saaras:v2.5",
14 vad_signals=True
15 ) as ws:
16 # Send audio data
17 await ws.translate(audio=audio_data, encoding="audio/wav", sample_rate=16000)
18 print("[Debug]: Sent audio")
19
20 # Gracefully wait for streaming responses
21 with contextlib.suppress(asyncio.TimeoutError):
22 async with asyncio.timeout(10): # Adjust timeout as needed
23 async for message in ws:
24 print(message)
25 # Example message sequence:
26 # {"type": "speech_start"}
27 # {"type": "speech_end"}
28 # {"type": "translation", "text": "Translated text here"}
29
30if __name__ == "__main__":
31 asyncio.run(translate_stream())

When using vad_signals=True, expect a slight delay between receiving the “speech_end” signal and the final transcript with translation. This delay allows the model to process the complete audio segment, generate transcription, and perform translation. The timeout in the example above can be adjusted based on your audio length and requirements.

Flush Signal

When streaming audio, you may want to flush the buffer before sending a new chunk.
The flush_signal lets you:

  • Immediately process everything in the current buffer and get a transcript without waiting.
  • Reduce latency so that interactive experiences (like live captions or assistants) feel more natural.
  • Take control of when audio is finalized, instead of relying only on silence detection or timeouts.

Example Usage with flush_signal

Python

1import base64
2import asyncio
3from sarvamai import AsyncSarvamAI
4
5# Load and encode audio file
6with open(
7 "path/to/your/audio.wav",
8 "rb",
9) as f:
10 audio_data = base64.b64encode(f.read()).decode("utf-8")
11
12
13async def test_flush_signal():
14 """Simple test script for flush signal functionality"""
15
16 # Initialize client
17 client = AsyncSarvamAI(api_subscription_key="your-api-key")
18
19 # Connect to streaming service
20 async with client.speech_to_text_translate_streaming.connect() as ws:
21 print("Connected to streaming service")
22
23 # Send audio data
24 await ws.translate(audio=audio_data, encoding="audio/wav", sample_rate=16000)
25 print("Audio sent successfully")
26
27 # Send flush signal to process audio buffer
28 await ws.flush()
29 print("Flush signal sent")
30
31 # Collect responses
32 print("\nReceiving responses:")
33 async for message in ws:
34 print(f"Response: {message}")
35
36 # Stop after getting some responses to avoid infinite loop
37 break
38
39
40if __name__ == "__main__":
41 asyncio.run(test_flush_signal())