For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
CommunityAPI StatusAPI PricingSign Up
DocumentationAPI ReferencesCookbookIntegrationDeveloper Tools
DocumentationAPI ReferencesCookbookIntegrationDeveloper Tools
  • Getting Started
    • Welcome
    • Quickstart
    • SDKs & Libraries
    • Building for Indian Languages
    • Models
    • Credits & Rate Limits
    • Errors & Troubleshooting
    • Talk to us
    • Pricing
    • Changelog
  • API Guides & Tutorials
      • Overview
      • Which API to Use
      • Rest API
        • HTTP Stream
        • WebSocket
      • Pronunciation Dictionary
      • Best Practices
LogoLogo
CommunityAPI StatusAPI PricingSign Up
On this page
  • Why WebSocket Streaming
  • Code Examples
  • Best Practices
  • End of Speech Signal
  • Using send_completion_event
  • Streaming TTS WebSocket – Integration Guide
  • Input Message Types
  • Handling Disconnects
  • Voice-Agent Barge-In
API Guides & TutorialsText to SpeechStreaming API

Streaming Text-to-Speech API

||View as Markdown|
Was this page helpful?
Previous

Pronunciation Dictionary

Next
Built with

WebSocket-based streaming endpoint that sends audio chunks progressively as text is processed. Connect once, stream text in, receive audio out — no polling, no repeated HTTP calls.

For complete API reference documentation, see the Text-to-Speech API Reference section.

Common use cases:

  • Conversational AI agents — Stream TTS responses in real time for voice-based assistants
  • Interactive podcasts — Generate and play back dialogue on the fly with minimal delay
  • Low-latency applications — Any scenario where time-to-first-byte (TTFB) matters (IVR, live narration, kiosks)

Why WebSocket Streaming

Low Latency Playback

Audio playback begins as soon as the first chunk is synthesized — no need to wait for the full response.

11 Languages (10 Indian + English)

Hindi, Bengali, Tamil, Telugu, Kannada, Malayalam, Marathi, Gujarati, Punjabi, Odia, and English (Indian accent).

Persistent Connection

Single WebSocket connection handles multiple text to speech conversions. Send config once, then stream text continuously.

SDK Support

Python (AsyncSarvamAI) and JavaScript (SarvamAIClient) SDKs with built-in async/await and event-driven patterns for seamless integration.

Code Examples

Best Practices

  • Always send the config message first
  • Use flush messages strategically to ensure complete text processing
  • Send ping messages to maintain long-running connections
1import asyncio
2import base64
3from sarvamai import AsyncSarvamAI, AudioOutput
4
5async def tts_stream():
6 client = AsyncSarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")
7
8 async with client.text_to_speech_streaming.connect(model="bulbul:v3") as ws:
9 await ws.configure(target_language_code="hi-IN", speaker="shubh")
10 print("Sent configuration")
11
12 long_text = (
13 "भारत की संस्कृति विश्व की सबसे प्राचीन और समृद्ध संस्कृतियों में से एक है।"
14 "यह विविधता, सहिष्णुता और परंपराओं का अद्भुत संगम है, "
15 "जिसमें विभिन्न धर्म, भाषाएं, त्योहार, संगीत, नृत्य, वास्तुकला और जीवनशैली शामिल हैं।"
16 )
17
18 await ws.convert(long_text)
19 print("Sent text message")
20
21 await ws.flush()
22 print("Flushed buffer")
23
24 # Audio streams back as chunks. The server keeps the socket open for
25 # any further text, so we stop once no new audio arrives within a short
26 # window (the `async with` block closes the connection on exit).
27 chunk_count = 0
28 with open("output.mp3", "wb") as f:
29 while True:
30 try:
31 message = await asyncio.wait_for(ws.recv(), timeout=3.0)
32 except asyncio.TimeoutError:
33 break
34 if isinstance(message, AudioOutput):
35 chunk_count += 1
36 f.write(base64.b64decode(message.data.audio))
37 f.flush()
38
39 print(f"All {chunk_count} chunks saved to output.mp3")
40 print("Audio generation complete")
41
42
43if __name__ == "__main__":
44 asyncio.run(tts_stream())
45
46# --- Notebook/Colab usage ---
47# await tts_stream()

End of Speech Signal

The TTS streaming API now supports an end of speech signal that allows for clean stream termination when speech generation is complete.

Using send_completion_event

When you set send_completion_event=True in the connection, the API will send a completion event when speech generation ends, allowing your application to handle stream termination gracefully.

Python
1import asyncio
2import base64
3from sarvamai import AsyncSarvamAI, AudioOutput, EventResponse
4
5
6async def tts_stream():
7 client = AsyncSarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")
8
9 async with client.text_to_speech_streaming.connect(
10 model="bulbul:v3", send_completion_event=True
11 ) as ws:
12 await ws.configure(
13 target_language_code="hi-IN",
14 speaker="shubh",
15 )
16 print("Sent configuration")
17
18 long_text = (
19 "भारत की संस्कृति विश्व की सबसे प्राचीन और समृद्ध "
20 "संस्कृतियों में से एक है।"
21 "यह विविधता, सहिष्णुता और परंपराओं का अद्भुत संगम है, "
22 "जिसमें विभिन्न धर्म, भाषाएं, त्योहार, संगीत, नृत्य, "
23 "वास्तुकला और जीवनशैली शामिल हैं।"
24 )
25
26 await ws.convert(long_text)
27 print("Sent text message")
28
29 await ws.flush()
30 print("Flushed buffer")
31
32 chunk_count = 0
33 with open("output.mp3", "wb") as f:
34 async for message in ws:
35 if isinstance(message, AudioOutput):
36 chunk_count += 1
37 audio_chunk = base64.b64decode(message.data.audio)
38 f.write(audio_chunk)
39 f.flush()
40 elif isinstance(message, EventResponse):
41 print(f"Received completion event: {message.data.event_type}")
42 # Break when we receive the final event
43 if message.data.event_type == "final":
44 break
45
46 print(f"All {chunk_count} chunks saved to output.mp3")
47 print("Audio generation complete")
48
49
50if __name__ == "__main__":
51 asyncio.run(tts_stream())
52
53# --- Notebook/Colab usage ---
54# await tts_stream()

Streaming TTS WebSocket – Integration Guide

Easily convert text to speech in real time using Sarvam’s low-latency WebSocket-based TTS API.

Input Message Types

Config Message
Text Message
Flush Message
Ping Message

Sets up voice parameters and must be the first message sent after connection. Parameters:

  • min_buffer_size: Minimum character length that triggers buffer flushing for TTS model processing
  • max_chunk_length: Maximum length for sentence splitting (adjust based on content length)
  • output_audio_codec: Supports multiple formats: mp3, wav, aac, opus, flac, pcm (LINEAR16), mulaw (μ-law), and alaw (A-law)
  • output_audio_bitrate: Choose from 5 supported bitrate options
1{
2 "type": "config",
3 "data": {
4 "speaker": "shubh",
5 "target_language_code": "en-IN",
6 "pace": 1.2,
7 "min_buffer_size": 50,
8 "max_chunk_length": 200,
9 "output_audio_codec": "mp3",
10 "output_audio_bitrate": "128k"
11 }
12}

Handling Disconnects

Close codeMeaningWhat to do
1000Normal closureYou called close() — nothing to do
1001Going awayServer/client shutting down — reconnect
1006Abnormal closure (no close frame)Network drop — reconnect with backoff
1011Server errorRetry with backoff; if persistent, check status
4xxxApplication-specificRead the close reason (e.g. auth/quota); fix before reconnecting

Codes 1000–1015 are standard WebSocket codes; 4000–4999 are application-specific — always read the accompanying close reason rather than assuming a fixed meaning. An idle connection closes automatically after ~1 minute, so send ping() to keep long-lived sessions open.

Reconnect with exponential backoff (pseudocode):

attempt = 0
while not connected and attempt < MAX_ATTEMPTS:
try:
open WebSocket, send config, resume sending text
attempt = 0 # reset on success
except (close 1006 / 1011 / network error):
delay = min(BASE * 2 ** attempt, MAX_DELAY) # e.g. 0.5s, 1s, 2s, 4s ... capped
sleep(delay + small random jitter)
attempt += 1
on close 4xxx (auth/quota): stop and surface the error # do not blind-retry

Don’t auto-retry on 4xxx auth/quota closes — fix the cause first (see Errors & Troubleshooting).

Voice-Agent Barge-In

When a user interrupts the agent mid-reply, you want playback to stop instantly. The TTS WebSocket has no server-side cancel/clear message — convert, flush, ping, and close are the only client messages. Handle barge-in entirely on the client:

  1. Stop playback locally the moment your STT detects speech_start — flush/clear your local audio buffer and stop the player.
  2. Close the TTS socket for the interrupted utterance so the server stops generating (any in-flight chunks are simply discarded on your side).
  3. Open a fresh connection for the agent’s next reply.
1# In your STT speech_start handler (see the STT WebSocket barge-in recipe):
2tts_player.stop() # 1. stop and clear local audio playback immediately
3await tts_ws.close() # 2. close the TTS socket to stop further generation
4# 3. open a new client.text_to_speech_streaming.connect(...) for the next turn

Because there’s no in-band cancel, keep TTS replies chunked into shorter convert() calls so a barge-in discards less already-generated audio. See the STT WebSocket barge-in recipe, the LiveKit and Pipecat voice-agent guides, and Credits & Rate Limits for streaming concurrency limits.