Streaming Text-to-Speech API

Real-time Processing

Real-time conversion of text into spoken audio, where the audio is generated and played back progressively as the text is being processed.

  • Efficient for long texts
  • Real-time conversion
  • Handle multiple requests easily
  • Low latency audio generation and faster responses

Features

Low Latency Playback
  • Audio starts playing immediately as the text is processed
  • Speaks dynamic or live content as it comes in
Language Support
  • Multiple Indian languages and English support
  • Language code specification (e.g., “kn-IN” for Kannada)
  • High accuracy transcription
Efficient Resource Usage
  • Streams small chunks of audio instead of generating everything at once.
  • Uses less memory and keeps performance stable even with long texts.
Integration
  • Python and JavaScript SDK with async support
  • WebSocket connections
  • Easy-to-use API interface

Code Examples

Best Practices

  • Always send the config message first
  • Keep text chunks under 500 characters for optimal streaming
  • Use flush messages strategically to ensure complete text processing
  • Send ping messages to maintain long-running connections
  • Handle error responses appropriately in your application logic
1import asyncio
2import base64
3from sarvamai import AsyncSarvamAI, AudioOutput
4import websockets
5
6async def tts_stream():
7 client = AsyncSarvamAI(api_subscription_key="YOUR_API_KEY")
8
9 async with client.text_to_speech_streaming.connect(model="bulbul:v2") as ws:
10 await ws.configure(target_language_code="hi-IN", speaker="anushka")
11 print("Sent configuration")
12
13 long_text = (
14 "भारत की संस्कृति विश्व की सबसे प्राचीन और समृद्ध संस्कृतियों में से एक है।"
15 "यह विविधता, सहिष्णुता और परंपराओं का अद्भुत संगम है, "
16 "जिसमें विभिन्न धर्म, भाषाएं, त्योहार, संगीत, नृत्य, वास्तुकला और जीवनशैली शामिल हैं।"
17 )
18
19 await ws.convert(long_text)
20 print("Sent text message")
21
22 await ws.flush()
23 print("Flushed buffer")
24
25 chunk_count = 0
26 with open("output.mp3", "wb") as f:
27 async for message in ws:
28 if isinstance(message, AudioOutput):
29 chunk_count += 1
30 audio_chunk = base64.b64decode(message.data.audio)
31 f.write(audio_chunk)
32 f.flush()
33
34 print(f"All {chunk_count} chunks saved to output.mp3")
35 print("Audio generation complete")
36
37
38 if hasattr(ws, "_websocket") and not ws._websocket.closed:
39 await ws._websocket.close()
40 print("WebSocket connection closed.")
41
42
43if __name__ == "__main__":
44 asyncio.run(tts_stream())
45
46# --- Notebook/Colab usage ---
47# await tts_stream()

Streaming TTS WebSocket – Integration Guide

Easily convert text to speech in real time using Sarvam’s low-latency WebSocket-based TTS API.

Input Message Types

Sets up voice parameters and must be the first message sent after connection. Parameters:

  • min_buffer_size: Minimum character length that triggers buffer flushing for TTS model processing
  • max_chunk_length: Maximum length for sentence splitting (adjust based on content length)
  • output_audio_codec: Currently supports MP3 only (optimized for real-time playback)
  • output_audio_bitrate: Choose from 5 supported bitrate options
1{
2 "type": "config",
3 "data": {
4 "target_language_code": "en-IN",
5 "pitch": 1.2,
6 "pace": 0.8,
7 "min_buffer_size": 50,
8 "max_chunk_length": 200,
9 "output_audio_codec": "mp3",
10 "output_audio_bitrate": 128
11 }
12}