For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
CommunityAPI StatusAPI PricingSign Up
DocumentationAPI ReferencesCookbookIntegrationDeveloper Tools
DocumentationAPI ReferencesCookbookIntegrationDeveloper Tools
  • Getting Started
    • Welcome
    • Quickstart
    • SDKs & Libraries
    • Building for Indian Languages
    • Models
    • Credits & Rate Limits
    • Errors & Troubleshooting
    • Talk to us
    • Pricing
    • Changelog
  • API Guides & Tutorials
      • Overview
      • Which API to Use
      • Rest API
      • Pronunciation Dictionary
      • Best Practices
LogoLogo
CommunityAPI StatusAPI PricingSign Up
On this page
  • Quick decision
  • Comparison
  • When to use each
  • Related
API Guides & TutorialsText to Speech

Which Text-to-Speech API to Use

||View as Markdown|
Was this page helpful?
Previous

Text-to-Speech Rest API

Next
Built with

Sarvam gives you three ways to synthesize speech from the same Bulbul voices: the REST, HTTP streaming, and WebSocket APIs. They produce identical audio quality — the difference is how the audio reaches you (one JSON blob, a binary stream, or incremental chunks) and how interactive the connection is. Use this page to pick one before you start integrating.

Quick decision

REST

Short text, and you want the full audio file back in one call.

HTTP Stream

Start playback/saving before the whole clip is synthesized — one POST, no WebSocket.

WebSocket

Voice agents streaming text from an LLM, many utterances on one connection.

Comparison

RESTHTTP StreamWebSocket
EndpointPOST /text-to-speechPOST /text-to-speech/streamGET /text-to-speech/ws
ConnectionSingle request/responseSingle streamed responsePersistent, bidirectional
SetupNoneNoneHandshake + config message
Max text2500 characters3500 characters2500 characters per message (send many)
Audio outputBase64 audio inside JSON (decode once)Raw binary stream (play/save directly)Base64 chunks (decode each)
Time-to-first-audioAfter full synthesisLow — first chunk streams earlyLowest on a warm connection
UtterancesOne per requestOne per requestMany per connection
Connection reuseNew request each timeNew request each timeOne connection, many conversions
Best forShort, fixed text; simplest integrationServer-side pipelines, proxying, edge/serverlessInteractive voice agents, multi-turn conversations

When to use each

REST — POST /text-to-speech

  • The text is short (≤2500 characters) and known up front.
  • You’re fine receiving the complete audio after synthesis finishes.
  • Returns base64 audio inside a JSON response — decode it once and save/play.
  • REST API guide →

HTTP Stream — POST /text-to-speech/stream

  • You want playback or saving to begin before the full clip is ready, but don’t need a persistent connection.
  • Ideal for server-side generation, proxying the stream to a client, or runtimes without WebSocket support (serverless, edge).
  • Returns a raw binary audio stream — pipe it straight to a file or player, no decoding.
  • Accepts the longest single payload (up to 3500 characters).
  • HTTP Streaming guide →

WebSocket — GET /text-to-speech/ws

  • You’re building a conversational agent that streams text incrementally (e.g. token-by-token from an LLM).
  • You need to send many utterances without reconnecting, keeping time-to-first-audio low on successive turns.
  • You want fine-grained control over buffering and flushing.
  • WebSocket guide →

Both streaming transports start returning audio before the full clip is synthesized. The difference is control: HTTP Stream is one request in, one stream out; WebSocket keeps the connection open for multiple utterances and finer buffering control. See the in-depth HTTP Stream vs WebSocket comparison.

Related

  • Text-to-Speech overview
  • Supported audio formats & MIME types
  • TTS best practices
  • Credits & Rate Limits