Which Text-to-Speech API to Use

View as Markdown

Sarvam gives you three ways to synthesize speech from the same Bulbul voices: the REST, HTTP streaming, and WebSocket APIs. They produce identical audio quality — the difference is how the audio reaches you (one JSON blob, a binary stream, or incremental chunks) and how interactive the connection is. Use this page to pick one before you start integrating.

Quick decision

Comparison

RESTHTTP StreamWebSocket
EndpointPOST /text-to-speechPOST /text-to-speech/streamGET /text-to-speech/ws
ConnectionSingle request/responseSingle streamed responsePersistent, bidirectional
SetupNoneNoneHandshake + config message
Max text2500 characters3500 characters2500 characters per message (send many)
Audio outputBase64 audio inside JSON (decode once)Raw binary stream (play/save directly)Base64 chunks (decode each)
Time-to-first-audioAfter full synthesisLow — first chunk streams earlyLowest on a warm connection
UtterancesOne per requestOne per requestMany per connection
Connection reuseNew request each timeNew request each timeOne connection, many conversions
Best forShort, fixed text; simplest integrationServer-side pipelines, proxying, edge/serverlessInteractive voice agents, multi-turn conversations

When to use each

REST — POST /text-to-speech

  • The text is short (≤2500 characters) and known up front.
  • You’re fine receiving the complete audio after synthesis finishes.
  • Returns base64 audio inside a JSON response — decode it once and save/play.
  • REST API guide →

HTTP Stream — POST /text-to-speech/stream

  • You want playback or saving to begin before the full clip is ready, but don’t need a persistent connection.
  • Ideal for server-side generation, proxying the stream to a client, or runtimes without WebSocket support (serverless, edge).
  • Returns a raw binary audio stream — pipe it straight to a file or player, no decoding.
  • Accepts the longest single payload (up to 3500 characters).
  • HTTP Streaming guide →

WebSocket — GET /text-to-speech/ws

  • You’re building a conversational agent that streams text incrementally (e.g. token-by-token from an LLM).
  • You need to send many utterances without reconnecting, keeping time-to-first-audio low on successive turns.
  • You want fine-grained control over buffering and flushing.
  • WebSocket guide →

Both streaming transports start returning audio before the full clip is synthesized. The difference is control: HTTP Stream is one request in, one stream out; WebSocket keeps the connection open for multiple utterances and finer buffering control. See the in-depth HTTP Stream vs WebSocket comparison.