> For clean Markdown of any page, append `.md` to the page URL.
> For a complete documentation index, see https://docs.sarvam.ai/llms.txt.
> For full documentation content in one file, see https://docs.sarvam.ai/llms-full.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.sarvam.ai/_mcp/server.

# HTTP Streaming API

> Stream TTS audio over a single HTTP POST request. No WebSocket setup, no connection management — just POST text and pipe the audio response.

`POST /text-to-speech/stream` — send text in, get a binary audio stream back. The response starts arriving as soon as the first audio chunk is ready, so you can begin playback or piping without waiting for the full file.

No WebSocket handshake, no config messages, no connection lifecycle. One HTTP request, one streamed response.

**Common use cases:**

* **Backend audio generation** — Pipe audio directly to a file, S3, or a downstream service
* **API proxying** — Forward the stream to your frontend or mobile client as-is
* **Batch processing** — Generate audio for a queue of texts using simple HTTP calls
* **Serverless / edge** — Works in any environment that supports HTTP — no WebSocket runtime needed

***

## HTTP Stream vs WebSocket — When to Use Which

Both give you streaming audio. The difference is how much control you need.

|                      | HTTP Stream                                                     | WebSocket                                                |
| -------------------- | --------------------------------------------------------------- | -------------------------------------------------------- |
| **Protocol**         | Single `POST` request                                           | Persistent bidirectional connection                      |
| **Setup**            | Zero — it's a normal HTTP call                                  | Handshake + config message before first text             |
| **Endpoint**         | `/text-to-speech/stream`                                        | `/text-to-speech/ws`                                     |
| **Text input**       | One text payload per request                                    | Send multiple texts on the same connection               |
| **Max text**         | 3500 characters                                                 | 2500 characters per message (send many)                  |
| **Audio output**     | Binary stream (play/save directly)                              | Base64-encoded chunks (decode each one)                  |
| **Connection reuse** | New connection per request                                      | One connection, many conversions                         |
| **Best for**         | One-shot generation, server-side pipelines, simple integrations | Voice agents, interactive apps, multi-turn conversations |

**Use HTTP Stream when:**

* You have a complete text and just need audio back
* You're generating audio server-side (batch jobs, API endpoints, CI pipelines)
* Your runtime doesn't support WebSocket (serverless functions, edge workers)
* You want the simplest possible integration — `curl` works out of the box

**Use WebSocket when:**

* You're building a conversational agent that streams text incrementally (e.g., from an LLM)
* You need to send multiple texts without reconnecting
* Low time-to-first-byte on successive utterances matters (connection is already warm)
* You need fine-grained control over buffering and flushing

***

## Code Examples

```python
from sarvamai import SarvamAI

client = SarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")

chunks = []
for chunk in client.text_to_speech.convert_stream(
    text="नमस्ते! Sarvam AI में आपका स्वागत है। हम India की हर language को voice देते हैं।",
    target_language_code="hi-IN",
    speaker="shubh",
    model="bulbul:v3",
    output_audio_codec="mp3",
):
    chunks.append(chunk)

audio = b"".join(chunks)
with open("output.mp3", "wb") as f:
    f.write(audio)
print(f"Saved output.mp3 ({len(audio)} bytes)")
```

```javascript
import { SarvamAIClient } from "sarvamai";
import fs from "fs";

const client = new SarvamAIClient({
  apiSubscriptionKey: "YOUR_SARVAM_API_KEY"
});

const response = await client.textToSpeech.convertStream({
  text: "नमस्ते! Sarvam AI में आपका स्वागत है। हम India की हर language को voice देते हैं।",
  target_language_code: "hi-IN",
  speaker: "shubh",
  model: "bulbul:v3",
  output_audio_codec: "mp3",
});

const audio = Buffer.from(await response.arrayBuffer());
fs.writeFileSync("output.mp3", audio);
console.log(`Saved output.mp3 (${audio.length} bytes)`);
```

```bash
curl -X POST https://api.sarvam.ai/text-to-speech/stream \
  -H "api-subscription-key: YOUR_SARVAM_API_KEY" \
  -H "Content-Type: application/json" \
  --output output.mp3 \
  -d '{
    "text": "नमस्ते! Sarvam AI में आपका स्वागत है। हम India की हर language को voice देते हैं।",
    "target_language_code": "hi-IN",
    "speaker": "shubh",
    "model": "bulbul:v3",
    "output_audio_codec": "mp3"
  }'

# output.mp3 is ready to play
```

***

## Piping the Stream

Since the response is a raw binary audio stream, you can pipe it directly without buffering the whole file in memory.

```python
from sarvamai import SarvamAI

client = SarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")

with open("output.mp3", "wb") as f:
    for chunk in client.text_to_speech.convert_stream(
        text="भारत की संस्कृति विश्व की सबसे प्राचीन और समृद्ध संस्कृतियों में से एक है।",
        target_language_code="hi-IN",
        speaker="shubh",
        model="bulbul:v3",
        output_audio_codec="mp3",
    ):
        f.write(chunk)
```

```bash
curl -sN -X POST https://api.sarvam.ai/text-to-speech/stream \
  -H "api-subscription-key: YOUR_SARVAM_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "This is a live playback test from Sarvam TTS.",
    "target_language_code": "en-IN",
    "speaker": "shubh",
    "model": "bulbul:v3",
    "output_audio_codec": "mp3"
  }' | ffplay -nodisp -autoexit -
```

***

## Request Parameters

| Parameter                 | Type    | Required | Default     | Description                                                                                                               |
| ------------------------- | ------- | -------- | ----------- | ------------------------------------------------------------------------------------------------------------------------- |
| `text`                    | string  | Yes      | —           | Text to convert. Max 3500 characters. Supports code-mixed text.                                                           |
| `target_language_code`    | string  | No       | `en-IN`     | BCP-47 language code (`hi-IN`, `ta-IN`, `en-IN`, etc.)                                                                    |
| `speaker`                 | string  | No       | `shubh`     | Speaker voice. See [voice list](/api-reference-docs/api-guides-tutorials/text-to-speech/rest-api).                        |
| `model`                   | string  | No       | `bulbul:v2` | `bulbul:v3` (recommended) or `bulbul:v2`                                                                                  |
| `output_audio_codec`      | string  | No       | `mp3`       | `mp3`, `wav`, `aac`, `opus`, `flac`, `linear16`, `mulaw`, `alaw`                                                          |
| `output_audio_bitrate`    | string  | No       | `128k`      | `32k`, `64k`, `128k`, `192k`, `256k`                                                                                      |
| `pace`                    | number  | No       | `1.0`       | Speech speed. v3: `0.5`–`2.0`, v2: `0.3`–`3.0`                                                                            |
| `speech_sample_rate`      | number  | No       | `22050`     | Output sample rate in Hz                                                                                                  |
| `temperature`             | number  | No       | `0.6`       | Expressiveness. `0.01`–`1.0`. v3 only.                                                                                    |
| `dict_id`                 | string  | No       | —           | [Pronunciation dictionary](/api-reference-docs/api-guides-tutorials/text-to-speech/pronunciation-dictionary) ID. v3 only. |
| `enable_preprocessing`    | boolean | No       | `false`     | Normalize English words and numbers before synthesis                                                                      |
| `enable_cached_responses` | boolean | No       | `false`     | Enable response caching (beta)                                                                                            |

***

## Response

The response is a **binary audio stream** — not JSON, not base64. The `Content-Type` header matches your requested codec (e.g., `audio/mpeg` for MP3).

You can:

* Save it directly to a file (`--output` in cURL, `f.write(chunk)` in Python)
* Pipe it to an audio player
* Forward it to a client as a streaming HTTP response

This is different from the REST endpoint (`/text-to-speech`) which returns base64-encoded audio inside a JSON response. The stream endpoint returns raw binary audio — no decoding needed.

***

## With Pronunciation Dictionary

Pass `dict_id` to apply custom pronunciations during streaming synthesis:

```python
from sarvamai import SarvamAI

client = SarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")

with open("output.mp3", "wb") as f:
    for chunk in client.text_to_speech.convert_stream(
        text="NEFT transfer karein aur KYC complete karein",
        target_language_code="hi-IN",
        speaker="shubh",
        model="bulbul:v3",
        dict_id="p_5cb7faa6",
        output_audio_codec="mp3",
    ):
        f.write(chunk)
```

```javascript
import { SarvamAIClient } from "sarvamai";
import fs from "fs";

const client = new SarvamAIClient({
  apiSubscriptionKey: "YOUR_SARVAM_API_KEY"
});

  const response = await client.textToSpeech.convertStream({
    text: "NEFT transfer karein aur KYC complete karein",
  target_language_code: "hi-IN",
  speaker: "shubh",
  model: "bulbul:v3",
  dict_id: "p_5cb7faa6",
  output_audio_codec: "mp3",
});

const audio = Buffer.from(await response.arrayBuffer());
fs.writeFileSync("output.mp3", audio);
```

```bash
curl -X POST https://api.sarvam.ai/text-to-speech/stream \
  -H "api-subscription-key: YOUR_SARVAM_API_KEY" \
  -H "Content-Type: application/json" \
  --output output.mp3 \
  -d '{
    "text": "NEFT transfer karein aur KYC complete karein",
    "target_language_code": "hi-IN",
    "speaker": "shubh",
    "model": "bulbul:v3",
    "dict_id": "p_5cb7faa6",
    "output_audio_codec": "mp3"
  }'
```

See the [Pronunciation Dictionary guide](/api-reference-docs/api-guides-tutorials/text-to-speech/pronunciation-dictionary) for setup.

***

## Error Handling

Errors return JSON (not audio) with the standard error format:

```json
{
  "error": {
    "message": "Text exceeds maximum length of 3500 characters",
    "code": "unprocessable_entity_error"
  }
}
```

| HTTP Status | Error Code                   | When                                 |
| ----------- | ---------------------------- | ------------------------------------ |
| `400`       | `invalid_request_error`      | Missing or malformed parameters      |
| `403`       | `invalid_api_key_error`      | Invalid or missing API key           |
| `422`       | `unprocessable_entity_error` | Text too long, invalid speaker/model |
| `429`       | `insufficient_quota_error`   | Rate limit or quota exceeded         |
| `500`       | `internal_server_error`      | Server error — retry                 |

```python
from sarvamai import SarvamAI
from sarvamai.core.api_error import ApiError

client = SarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")

try:
    for chunk in client.text_to_speech.convert_stream(
        text="Hello from Sarvam AI!",
        target_language_code="en-IN",
        speaker="shubh",
        model="bulbul:v3",
        output_audio_codec="mp3",
    ):
        pass  # process chunk
except ApiError as e:
    print(f"Error {e.status_code}: {e.body}")
```

```javascript
import { SarvamAIClient } from "sarvamai";

const client = new SarvamAIClient({
  apiSubscriptionKey: "YOUR_SARVAM_API_KEY"
});

try {
  const response = await client.textToSpeech.convertStream({
    text: "Hello from Sarvam AI!",
    target_language_code: "en-IN",
    speaker: "shubh",
    model: "bulbul:v3",
    output_audio_codec: "mp3",
  });

  const audio = Buffer.from(await response.arrayBuffer());
  // process audio...
} catch (error) {
  console.error(`Error ${error.statusCode}: ${error.body}`);
}
```

```bash
# Check HTTP status code — 200 means audio, anything else is JSON error
curl -s -o output.mp3 -w "%{http_code}" -X POST \
  https://api.sarvam.ai/text-to-speech/stream \
  -H "api-subscription-key: YOUR_SARVAM_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hello from Sarvam AI!",
    "target_language_code": "en-IN",
    "model": "bulbul:v3",
    "output_audio_codec": "mp3"
  }'
```

Full endpoint spec with all parameters and error details is in the [API Reference](/api-reference-docs/text-to-speech/convert-stream).

Need help? Reach out on [Discord](https://discord.com/invite/5rAsykttcs).