> For clean Markdown of any page, append `.md` to the page URL.
> For a complete documentation index, see https://docs.sarvam.ai/llms.txt.
> For full documentation content in one file, see https://docs.sarvam.ai/llms-full.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.sarvam.ai/_mcp/server.

# Which Text-to-Speech API to Use

> Compare Sarvam's Text-to-Speech APIs — REST, HTTP streaming, and WebSocket — and pick the right one for your latency, text length, and interactivity needs.

Sarvam gives you three ways to synthesize speech from the same Bulbul voices: the **REST**, **HTTP streaming**, and **WebSocket** APIs. They produce identical audio quality — the difference is **how the audio reaches you** (one JSON blob, a binary stream, or incremental chunks) and **how interactive** the connection is. Use this page to pick one before you start integrating.

## Quick decision

Short text, and you want the full audio file back in one call.

Start playback/saving before the whole clip is synthesized — one POST, no WebSocket.

Voice agents streaming text from an LLM, many utterances on one connection.

## Comparison

|                         | REST                                    | HTTP Stream                                      | WebSocket                                          |
| ----------------------- | --------------------------------------- | ------------------------------------------------ | -------------------------------------------------- |
| **Endpoint**            | `POST /text-to-speech`                  | `POST /text-to-speech/stream`                    | `GET /text-to-speech/ws`                           |
| **Connection**          | Single request/response                 | Single streamed response                         | Persistent, bidirectional                          |
| **Setup**               | None                                    | None                                             | Handshake + config message                         |
| **Max text**            | 2500 characters                         | 3500 characters                                  | 2500 characters per message (send many)            |
| **Audio output**        | Base64 audio inside JSON (decode once)  | Raw binary stream (play/save directly)           | Base64 chunks (decode each)                        |
| **Time-to-first-audio** | After full synthesis                    | Low — first chunk streams early                  | Lowest on a warm connection                        |
| **Utterances**          | One per request                         | One per request                                  | Many per connection                                |
| **Connection reuse**    | New request each time                   | New request each time                            | One connection, many conversions                   |
| **Best for**            | Short, fixed text; simplest integration | Server-side pipelines, proxying, edge/serverless | Interactive voice agents, multi-turn conversations |

## When to use each

**REST — `POST /text-to-speech`**

* The text is short (≤2500 characters) and known up front.
* You're fine receiving the complete audio after synthesis finishes.
* Returns **base64 audio inside a JSON response** — decode it once and save/play.
* [REST API guide →](/api-reference-docs/api-guides-tutorials/text-to-speech/rest-api)

**HTTP Stream — `POST /text-to-speech/stream`**

* You want playback or saving to begin before the full clip is ready, but don't need a persistent connection.
* Ideal for server-side generation, proxying the stream to a client, or runtimes without WebSocket support (serverless, edge).
* Returns a **raw binary audio stream** — pipe it straight to a file or player, no decoding.
* Accepts the longest single payload (up to 3500 characters).
* [HTTP Streaming guide →](/api-reference-docs/api-guides-tutorials/text-to-speech/streaming-api/http-stream)

**WebSocket — `GET /text-to-speech/ws`**

* You're building a conversational agent that streams text incrementally (e.g. token-by-token from an LLM).
* You need to send many utterances without reconnecting, keeping time-to-first-audio low on successive turns.
* You want fine-grained control over buffering and flushing.
* [WebSocket guide →](/api-reference-docs/api-guides-tutorials/text-to-speech/streaming-api/web-socket)

Both streaming transports start returning audio before the full clip is synthesized. The difference is **control**: HTTP Stream is one request in, one stream out; WebSocket keeps the connection open for multiple utterances and finer buffering control. See the in-depth [HTTP Stream vs WebSocket](/api-reference-docs/api-guides-tutorials/text-to-speech/streaming-api/http-stream) comparison.

## Related

* [Text-to-Speech overview](/api-reference-docs/api-guides-tutorials/text-to-speech/overview)
* [Supported audio formats & MIME types](/api-reference-docs/api-guides-tutorials/text-to-speech/overview)
* [TTS best practices](/api-reference-docs/api-guides-tutorials/text-to-speech/best-practices)
* [Credits & Rate Limits](/api-reference-docs/ratelimits)