Which Text-to-Speech API to Use

Sarvam gives you three ways to synthesize speech from the same Bulbul voices: the REST, HTTP streaming, and WebSocket APIs. They produce identical audio quality — the difference is how the audio reaches you (one JSON blob, a binary stream, or incremental chunks) and how interactive the connection is. Use this page to pick one before you start integrating.

Quick decision

REST

Short text, and you want the full audio file back in one call.

HTTP Stream

Start playback/saving before the whole clip is synthesized — one POST, no WebSocket.

WebSocket

Voice agents streaming text from an LLM, many utterances on one connection.

Comparison

	REST	HTTP Stream	WebSocket
Endpoint	`POST /text-to-speech`	`POST /text-to-speech/stream`	`GET /text-to-speech/ws`
Connection	Single request/response	Single streamed response	Persistent, bidirectional
Setup	None	None	Handshake + config message
Max text	2500 characters	3500 characters	2500 characters per message (send many)
Audio output	Base64 audio inside JSON (decode once)	Raw binary stream (play/save directly)	Base64 chunks (decode each)
Time-to-first-audio	After full synthesis	Low — first chunk streams early	Lowest on a warm connection
Utterances	One per request	One per request	Many per connection
Connection reuse	New request each time	New request each time	One connection, many conversions
Best for	Short, fixed text; simplest integration	Server-side pipelines, proxying, edge/serverless	Interactive voice agents, multi-turn conversations

When to use each

REST — POST /text-to-speech

The text is short (≤2500 characters) and known up front.
You’re fine receiving the complete audio after synthesis finishes.
Returns base64 audio inside a JSON response — decode it once and save/play.
REST API guide →

HTTP Stream — POST /text-to-speech/stream

You want playback or saving to begin before the full clip is ready, but don’t need a persistent connection.
Ideal for server-side generation, proxying the stream to a client, or runtimes without WebSocket support (serverless, edge).
Returns a raw binary audio stream — pipe it straight to a file or player, no decoding.
Accepts the longest single payload (up to 3500 characters).
HTTP Streaming guide →

WebSocket — GET /text-to-speech/ws

You’re building a conversational agent that streams text incrementally (e.g. token-by-token from an LLM).
You need to send many utterances without reconnecting, keeping time-to-first-audio low on successive turns.
You want fine-grained control over buffering and flushing.
WebSocket guide →

Both streaming transports start returning audio before the full clip is synthesized. The difference is control: HTTP Stream is one request in, one stream out; WebSocket keeps the connection open for multiple utterances and finer buffering control. See the in-depth HTTP Stream vs WebSocket comparison.

Quick decision

Comparison

When to use each

Related