Streaming Text-to-Speech API

Real-time Processing

Real-time conversion of text into spoken audio, where the audio is generated and played back progressively as the text is being processed.

Efficient for long texts
Real-time conversion
Handle multiple requests easily
Low latency audio generation and faster responses

Features

Low Latency Playback

Audio starts playing immediately as the text is processed
Speaks dynamic or live content as it comes in

Language Support

Multiple Indian languages and English support
Language code specification (e.g., “kn-IN” for Kannada)
High accuracy transcription

Efficient Resource Usage

Streams small chunks of audio instead of generating everything at once.
Uses less memory and keeps performance stable even with long texts.

Integration

Python and JavaScript SDK with async support
WebSocket connections
Easy-to-use API interface

Code Examples

Best Practices

Always send the config message first
Keep text chunks under 500 characters for optimal streaming
Use flush messages strategically to ensure complete text processing
Send ping messages to maintain long-running connections
Handle error responses appropriately in your application logic

Python

JavaScript

1 import asyncio
2 import base64
3 from sarvamai import AsyncSarvamAI, AudioOutput
4 import websockets
5 
6 async def tts_stream():
7     client = AsyncSarvamAI(api_subscription_key="YOUR_API_KEY")
8 
9     async with client.text_to_speech_streaming.connect(model="bulbul:v2") as ws:
10         await ws.configure(target_language_code="hi-IN", speaker="anushka")
11         print("Sent configuration")
12 
13         long_text = (
14             "भारत की संस्कृति विश्व की सबसे प्राचीन और समृद्ध संस्कृतियों में से एक है।"
15             "यह विविधता, सहिष्णुता और परंपराओं का अद्भुत संगम है, "
16             "जिसमें विभिन्न धर्म, भाषाएं, त्योहार, संगीत, नृत्य, वास्तुकला और जीवनशैली शामिल हैं।"
17         )
18 
19         await ws.convert(long_text)
20         print("Sent text message")
21 
22         await ws.flush()
23         print("Flushed buffer")
24 
25         chunk_count = 0
26         with open("output.mp3", "wb") as f:
27             async for message in ws:
28                 if isinstance(message, AudioOutput):
29                     chunk_count += 1
30                     audio_chunk = base64.b64decode(message.data.audio)
31                     f.write(audio_chunk)
32                     f.flush()
33 
34         print(f"All {chunk_count} chunks saved to output.mp3")
35         print("Audio generation complete")
36 
37         
38         if hasattr(ws, "_websocket") and not ws._websocket.closed:
39             await ws._websocket.close()
40             print("WebSocket connection closed.")
41 
42 
43 if __name__ == "__main__":
44     asyncio.run(tts_stream())
45 
46 # --- Notebook/Colab usage ---
47 # await tts_stream()

Streaming TTS WebSocket – Integration Guide

Easily convert text to speech in real time using Sarvam’s low-latency WebSocket-based TTS API.

Input Message Types

Config Message

Text Message

Flush Message

Ping Message

Sets up voice parameters and must be the first message sent after connection. Parameters:

min_buffer_size: Minimum character length that triggers buffer flushing for TTS model processing
max_chunk_length: Maximum length for sentence splitting (adjust based on content length)
output_audio_codec: Currently supports mp3 and wav formats (optimized for real-time playback)
output_audio_bitrate: Choose from 5 supported bitrate options

1 {
2   "type": "config",
3   "data": {
4     "speaker": "anushka",
5     "target_language_code": "en-IN",
6     "pitch": 0.8,
7     "pace": 2,
8     "min_buffer_size": 50,
9     "max_chunk_length": 200,
10     "output_audio_codec": "mp3",
11     "output_audio_bitrate": "128k"
12   }
13 }