Which Text-to-Speech API to Use
Which Text-to-Speech API to Use
Which Text-to-Speech API to Use
Sarvam gives you three ways to synthesize speech from the same Bulbul voices: the REST, HTTP streaming, and WebSocket APIs. They produce identical audio quality — the difference is how the audio reaches you (one JSON blob, a binary stream, or incremental chunks) and how interactive the connection is. Use this page to pick one before you start integrating.
Short text, and you want the full audio file back in one call.
Start playback/saving before the whole clip is synthesized — one POST, no WebSocket.
Voice agents streaming text from an LLM, many utterances on one connection.
REST — POST /text-to-speech
HTTP Stream — POST /text-to-speech/stream
WebSocket — GET /text-to-speech/ws
Both streaming transports start returning audio before the full clip is synthesized. The difference is control: HTTP Stream is one request in, one stream out; WebSocket keeps the connection open for multiple utterances and finer buffering control. See the in-depth HTTP Stream vs WebSocket comparison.