Which Text-to-Speech API to Use
Which Text-to-Speech API to Use
Sarvam gives you three ways to synthesize speech from the same Bulbul voices: the REST, HTTP streaming, and WebSocket APIs. They produce identical audio quality — the difference is how the audio reaches you (one JSON blob, a binary stream, or incremental chunks) and how interactive the connection is. Use this page to pick one before you start integrating.
Quick decision
Short text, and you want the full audio file back in one call.
Start playback/saving before the whole clip is synthesized — one POST, no WebSocket.
Voice agents streaming text from an LLM, many utterances on one connection.
Comparison
When to use each
REST — POST /text-to-speech
- The text is short (≤2500 characters) and known up front.
- You’re fine receiving the complete audio after synthesis finishes.
- Returns base64 audio inside a JSON response — decode it once and save/play.
- REST API guide →
HTTP Stream — POST /text-to-speech/stream
- You want playback or saving to begin before the full clip is ready, but don’t need a persistent connection.
- Ideal for server-side generation, proxying the stream to a client, or runtimes without WebSocket support (serverless, edge).
- Returns a raw binary audio stream — pipe it straight to a file or player, no decoding.
- Accepts the longest single payload (up to 3500 characters).
- HTTP Streaming guide →
WebSocket — GET /text-to-speech/ws
- You’re building a conversational agent that streams text incrementally (e.g. token-by-token from an LLM).
- You need to send many utterances without reconnecting, keeping time-to-first-audio low on successive turns.
- You want fine-grained control over buffering and flushing.
- WebSocket guide →
Both streaming transports start returning audio before the full clip is synthesized. The difference is control: HTTP Stream is one request in, one stream out; WebSocket keeps the connection open for multiple utterances and finer buffering control. See the in-depth HTTP Stream vs WebSocket comparison.