Streaming Text-to-Speech API

Real-time Processing

Real-time conversion of text into spoken audio, where the audio is generated and played back progressively as the text is being processed.

  • Efficient for long texts
  • Real-time conversion
  • Handle multiple requests easily
  • Low latency audio generation and faster responses

Features

Low Latency Playback
  • Audio starts playing immediately as the text is processed
  • Speaks dynamic or live content as it comes in
Language Support
  • Multiple Indian languages and English support
  • Language code specification (e.g., “kn-IN” for Kannada)
  • High accuracy transcription
Efficient Resource Usage
  • Streams small chunks of audio instead of generating everything at once.

  • Uses less memory and keeps performance stable even with long texts.

Integration
  • Python and JavaScript SDK with async support
  • WebSocket connections
  • Easy-to-use API interface

Code Examples

Best Practices

  • Always send the config message first
  • Use flush messages strategically to ensure complete text processing
  • Send ping messages to maintain long-running connections
1import asyncio
2import base64
3from sarvamai import AsyncSarvamAI, AudioOutput
4
5async def tts_stream():
6 client = AsyncSarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")
7
8 async with client.text_to_speech_streaming.connect(model="bulbul:v2") as ws:
9 await ws.configure(target_language_code="hi-IN", speaker="anushka")
10 print("Sent configuration")
11
12 long_text = (
13 "भारत की संस्कृति विश्व की सबसे प्राचीन और समृद्ध संस्कृतियों में से एक है।"
14 "यह विविधता, सहिष्णुता और परंपराओं का अद्भुत संगम है, "
15 "जिसमें विभिन्न धर्म, भाषाएं, त्योहार, संगीत, नृत्य, वास्तुकला और जीवनशैली शामिल हैं।"
16 )
17
18 await ws.convert(long_text)
19 print("Sent text message")
20
21 await ws.flush()
22 print("Flushed buffer")
23
24 chunk_count = 0
25 with open("output.mp3", "wb") as f:
26 async for message in ws:
27 if isinstance(message, AudioOutput):
28 chunk_count += 1
29 audio_chunk = base64.b64decode(message.data.audio)
30 f.write(audio_chunk)
31 f.flush()
32
33 print(f"All {chunk_count} chunks saved to output.mp3")
34 print("Audio generation complete")
35
36
37 if hasattr(ws, "_websocket") and not ws._websocket.closed:
38 await ws._websocket.close()
39 print("WebSocket connection closed.")
40
41
42if __name__ == "__main__":
43 asyncio.run(tts_stream())
44
45# --- Notebook/Colab usage ---
46# await tts_stream()

End of Speech Signal

The TTS streaming API now supports an end of speech signal that allows for clean stream termination when speech generation is complete.

Using send_completion_event

When you set send_completion_event=True in the connection, the API will send a completion event when speech generation ends, allowing your application to handle stream termination gracefully.

Python
1import asyncio
2import base64
3from sarvamai import AsyncSarvamAI, AudioOutput, EventResponse
4
5
6async def tts_stream():
7 client = AsyncSarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")
8
9 async with client.text_to_speech_streaming.connect(
10 model="bulbul:v2", send_completion_event=True
11 ) as ws:
12 await ws.configure(
13 target_language_code="hi-IN",
14 speaker="anushka",
15 )
16 print("Sent configuration")
17
18 long_text = (
19 "भारत की संस्कृति विश्व की सबसे प्राचीन और समृद्ध "
20 "संस्कृतियों में से एक है।"
21 "यह विविधता, सहिष्णुता और परंपराओं का अद्भुत संगम है, "
22 "जिसमें विभिन्न धर्म, भाषाएं, त्योहार, संगीत, नृत्य, "
23 "वास्तुकला और जीवनशैली शामिल हैं।"
24 )
25
26 await ws.convert(long_text)
27 print("Sent text message")
28
29 await ws.flush()
30 print("Flushed buffer")
31
32 chunk_count = 0
33 with open("output.mp3", "wb") as f:
34 async for message in ws:
35 if isinstance(message, AudioOutput):
36 chunk_count += 1
37 audio_chunk = base64.b64decode(message.data.audio)
38 f.write(audio_chunk)
39 f.flush()
40 elif isinstance(message, EventResponse):
41 print(f"Received completion event: {message.data.event_type}")
42 # Break when we receive the final event
43 if message.data.event_type == "final":
44 break
45
46 print(f"All {chunk_count} chunks saved to output.mp3")
47 print("Audio generation complete")
48
49
50if __name__ == "__main__":
51 asyncio.run(tts_stream())
52
53# --- Notebook/Colab usage ---
54# await tts_stream()

Streaming TTS WebSocket – Integration Guide

Easily convert text to speech in real time using Sarvam’s low-latency WebSocket-based TTS API.

Input Message Types

Sets up voice parameters and must be the first message sent after connection. Parameters:

  • min_buffer_size: Minimum character length that triggers buffer flushing for TTS model processing
  • max_chunk_length: Maximum length for sentence splitting (adjust based on content length)
  • output_audio_codec: Supports multiple formats: mp3, wav, aac, opus, flac, pcm (LINEAR16), mulaw (μ-law), and alaw (A-law)
  • output_audio_bitrate: Choose from 5 supported bitrate options
1{
2 "type": "config",
3 "data": {
4 "speaker": "anushka",
5 "target_language_code": "en-IN",
6 "pitch": 0.8,
7 "pace": 2,
8 "min_buffer_size": 50,
9 "max_chunk_length": 200,
10 "output_audio_codec": "mp3",
11 "output_audio_bitrate": "128k"
12 }
13}