> For clean Markdown of any page, append `.md` to the page URL.
> For a complete documentation index, see https://docs.sarvam.ai/llms.txt.
> For full documentation content in one file, see https://docs.sarvam.ai/llms-full.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.sarvam.ai/_mcp/server.

# Streaming Text-to-Speech API

> Real-time conversion of text into speech using WebSocket connections. Efficient streaming for long texts with progressive audio generation and low-latency playback.

WebSocket-based streaming endpoint that sends audio chunks progressively as text is processed. Connect once, stream text in, receive audio out — no polling, no repeated HTTP calls.

For complete API reference documentation, see the [Text-to-Speech API Reference](https://docs.sarvam.ai/api-reference-docs/text-to-speech/convert) section.

**Common use cases:**

* **Conversational AI agents** — Stream TTS responses in real time for voice-based assistants
* **Interactive podcasts** — Generate and play back dialogue on the fly with minimal delay
* **Low-latency applications** — Any scenario where time-to-first-byte (TTFB) matters (IVR, live narration, kiosks)

## Why WebSocket Streaming

Audio playback begins as soon as the first chunk is synthesized — no need to wait for the full response.

Hindi, Bengali, Tamil, Telugu, Kannada, Malayalam, Marathi, Gujarati, Punjabi, Odia, and English (Indian accent).

Single WebSocket connection handles multiple text to speech conversions. Send config once, then stream text continuously.

Python (`AsyncSarvamAI`) and JavaScript (`SarvamAIClient`) SDKs with built-in async/await and event-driven patterns for seamless integration.

## Code Examples

### Best Practices

* Always send the config message first
* Use flush messages strategically to ensure complete text processing
* Send ping messages to maintain long-running connections

```python
import asyncio
import base64
from sarvamai import AsyncSarvamAI, AudioOutput

async def tts_stream():
    client = AsyncSarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")

    async with client.text_to_speech_streaming.connect(model="bulbul:v3") as ws:
        await ws.configure(target_language_code="hi-IN", speaker="shubh")
        print("Sent configuration")

        long_text = (
            "भारत की संस्कृति विश्व की सबसे प्राचीन और समृद्ध संस्कृतियों में से एक है।"
            "यह विविधता, सहिष्णुता और परंपराओं का अद्भुत संगम है, "
            "जिसमें विभिन्न धर्म, भाषाएं, त्योहार, संगीत, नृत्य, वास्तुकला और जीवनशैली शामिल हैं।"
        )

        await ws.convert(long_text)
        print("Sent text message")

        await ws.flush()
        print("Flushed buffer")

        # Audio streams back as chunks. The server keeps the socket open for
        # any further text, so we stop once no new audio arrives within a short
        # window (the `async with` block closes the connection on exit).
        chunk_count = 0
        with open("output.mp3", "wb") as f:
            while True:
                try:
                    message = await asyncio.wait_for(ws.recv(), timeout=3.0)
                except asyncio.TimeoutError:
                    break
                if isinstance(message, AudioOutput):
                    chunk_count += 1
                    f.write(base64.b64decode(message.data.audio))
                    f.flush()

        print(f"All {chunk_count} chunks saved to output.mp3")
        print("Audio generation complete")


if __name__ == "__main__":
    asyncio.run(tts_stream())

# --- Notebook/Colab usage ---
# await tts_stream()
```

```javascript
import { SarvamAIClient } from "sarvamai";
import fs from "fs";

async function main() {
const client = new SarvamAIClient({
apiSubscriptionKey: "YOUR_SARVAM_API_KEY",
});

const socket = await client.textToSpeechStreaming.connect({
model: "bulbul:v3",
});

let chunkCount = 0;
const outputStream = fs.createWriteStream("output.mp3");

let closeTimeout = null;

socket.on("open", () => {
console.log("Connection opened");

    socket.configureConnection({
      type: "config",
      data: {
        speaker: "shubh",
        target_language_code: "hi-IN",
      },
    });

    console.log("Configuration sent");

    const longText =
      "भारत की संस्कृति विश्व की सबसे प्राचीन और समृद्ध संस्कृतियों में से एक है।"+
      "यह विविधता, सहिष्णुता और परंपराओं का अद्भुत संगम है, जिसमें विभिन्न धर्म, भाषाएं, त्योहार, संगीत, नृत्य, वास्तुकला और जीवनशैली शामिल हैं।";

    socket.convert(longText);
    console.log("Text sent for conversion");


    closeTimeout = setTimeout(() => {
      console.log("Forcing socket close after timeout");
      socket.close();
    }, 10000);

});

socket.on("message", (message) => {
if (message.type === "audio") {
chunkCount++;
const audioBuffer = Buffer.from(message.data.audio, "base64");
outputStream.write(audioBuffer);
console.log(`Received and wrote chunk ${chunkCount}`);
} else {
console.log("Received message:", message);
}
});

socket.on("close", (event) => {
console.log("Connection closed:", event);
if (closeTimeout) clearTimeout(closeTimeout);
outputStream.end(() => {
console.log(`All ${chunkCount} chunks saved to output.mp3`);
});
});

socket.on("error", (error) => {
console.error("Error occurred:", error);
if (closeTimeout) clearTimeout(closeTimeout);
outputStream.end();
});

await socket.waitForOpen();
console.log("WebSocket is ready");
}

main().catch(console.error);

```

## End of Speech Signal

The TTS streaming API now supports an end of speech signal that allows for clean stream termination when speech generation is complete.

### Using `send_completion_event`

When you set `send_completion_event=True` in the connection, the API will send a completion event when speech generation ends, allowing your application to handle stream termination gracefully.

```python
import asyncio
import base64
from sarvamai import AsyncSarvamAI, AudioOutput, EventResponse


async def tts_stream():
    client = AsyncSarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")

    async with client.text_to_speech_streaming.connect(
        model="bulbul:v3", send_completion_event=True
    ) as ws:
        await ws.configure(
            target_language_code="hi-IN",
            speaker="shubh",
        )
        print("Sent configuration")

        long_text = (
            "भारत की संस्कृति विश्व की सबसे प्राचीन और समृद्ध "
            "संस्कृतियों में से एक है।"
            "यह विविधता, सहिष्णुता और परंपराओं का अद्भुत संगम है, "
            "जिसमें विभिन्न धर्म, भाषाएं, त्योहार, संगीत, नृत्य, "
            "वास्तुकला और जीवनशैली शामिल हैं।"
        )

        await ws.convert(long_text)
        print("Sent text message")

        await ws.flush()
        print("Flushed buffer")

        chunk_count = 0
        with open("output.mp3", "wb") as f:
            async for message in ws:
                if isinstance(message, AudioOutput):
                    chunk_count += 1
                    audio_chunk = base64.b64decode(message.data.audio)
                    f.write(audio_chunk)
                    f.flush()
                elif isinstance(message, EventResponse):
                    print(f"Received completion event: {message.data.event_type}")
                    # Break when we receive the final event
                    if message.data.event_type == "final":
                        break

        print(f"All {chunk_count} chunks saved to output.mp3")
        print("Audio generation complete")


if __name__ == "__main__":
    asyncio.run(tts_stream())

# --- Notebook/Colab usage ---
# await tts_stream()
```

## Streaming TTS WebSocket – Integration Guide

Easily convert text to speech in real time using Sarvam's low-latency WebSocket-based TTS API.

### Input Message Types

Sets up voice parameters and must be the first message sent after connection.
**Parameters:**

<ul>
  <li>
    <code>min_buffer_size</code>

    : Minimum character length that triggers buffer flushing for TTS model processing
  </li>

  <li>
    <code>max_chunk_length</code>

    : Maximum length for sentence splitting (adjust based on content length)
  </li>

  <li>
    <code>output_audio_codec</code>

    : Supports multiple formats: 

    `mp3`

    , 

    `wav`

    , 

    `aac`

    , 

    `opus`

    , 

    `flac`

    , 

    `pcm`

     (LINEAR16), 

    `mulaw`

     (μ-law), and 

    `alaw`

     (A-law)
  </li>

  <li>
    <code>output_audio_bitrate</code>

    : Choose from 5 supported bitrate options
  </li>
</ul>

```json
{
  "type": "config",
  "data": {
    "speaker": "shubh",
    "target_language_code": "en-IN",
    "pace": 1.2,
    "min_buffer_size": 50,
    "max_chunk_length": 200,
    "output_audio_codec": "mp3",
    "output_audio_bitrate": "128k"
  }
}
```

Sends text to be converted to speech.

* **Range**: 0-2500 characters
* **Recommended**: \<500 characters for optimal streaming performance
  Real-time endpoints perform better with longer character counts

```json
{
  "type": "text",
  "data": {
    "text": "This is an example sentence that will be converted to speech."
  }
}
```

Forces the text buffer to process immediately, regardless of the min\_buffer\_size threshold.
Use to ensure all text is processed.

```json
{
  "type": "flush"
}
```

Keeps the WebSocket connection alive; send periodically to avoid timeout.
The connection automatically closes after one minute of inactivity.

```json
{
  "type": "ping"
}
```

## Handling Disconnects

| Close code | Meaning                           | What to do                                                                   |
| ---------- | --------------------------------- | ---------------------------------------------------------------------------- |
| `1000`     | Normal closure                    | You called `close()` — nothing to do                                         |
| `1001`     | Going away                        | Server/client shutting down — reconnect                                      |
| `1006`     | Abnormal closure (no close frame) | Network drop — reconnect with backoff                                        |
| `1011`     | Server error                      | Retry with backoff; if persistent, check [status](https://status.sarvam.ai/) |
| `4xxx`     | Application-specific              | Read the close reason (e.g. auth/quota); fix before reconnecting             |

Codes `1000`–`1015` are standard WebSocket codes; `4000`–`4999` are application-specific — always read the accompanying close reason rather than assuming a fixed meaning. An idle connection closes automatically after \~1 minute, so send `ping()` to keep long-lived sessions open.

**Reconnect with exponential backoff** (pseudocode):

```text
attempt = 0
while not connected and attempt < MAX_ATTEMPTS:
    try:
        open WebSocket, send config, resume sending text
        attempt = 0                      # reset on success
    except (close 1006 / 1011 / network error):
        delay = min(BASE * 2 ** attempt, MAX_DELAY)   # e.g. 0.5s, 1s, 2s, 4s ... capped
        sleep(delay + small random jitter)
        attempt += 1
    on close 4xxx (auth/quota): stop and surface the error  # do not blind-retry
```

Don't auto-retry on `4xxx` auth/quota closes — fix the cause first (see [Errors & Troubleshooting](/api-reference-docs/errors-troubleshooting)).

## Voice-Agent Barge-In

When a user interrupts the agent mid-reply, you want playback to stop instantly. The TTS WebSocket has **no server-side cancel/clear message** — `convert`, `flush`, `ping`, and `close` are the only client messages. Handle barge-in entirely on the client:

1. **Stop playback locally** the moment your STT detects `speech_start` — flush/clear your local audio buffer and stop the player.
2. **Close the TTS socket** for the interrupted utterance so the server stops generating (any in-flight chunks are simply discarded on your side).
3. **Open a fresh connection** for the agent's next reply.

```python
# In your STT speech_start handler (see the STT WebSocket barge-in recipe):
tts_player.stop()        # 1. stop and clear local audio playback immediately
await tts_ws.close()     # 2. close the TTS socket to stop further generation
# 3. open a new client.text_to_speech_streaming.connect(...) for the next turn
```

Because there's no in-band cancel, keep TTS replies chunked into shorter `convert()` calls so a barge-in discards less already-generated audio. See the [STT WebSocket barge-in recipe](/api-reference-docs/speech-to-text/apis/streaming), the LiveKit and Pipecat voice-agent guides, and [Credits & Rate Limits](/api-reference-docs/ratelimits) for streaming concurrency limits.