Text-to-Speech Overview | Sarvam API Docs

Sarvam AI offers a powerful text-to-speech model: Bulbul V3 — advanced TTS with 30+ voices and high-quality natural speech synthesis for Indian languages.

Bulbul V3

Advanced text-to-speech model with 30+ voices and high-quality natural speech synthesis for Indian languages.

API Types

Available API types: REST API for quick conversions up to 2500 characters, and Streaming API for real-time audio via HTTP stream or WebSocket.

REST API

Generate speech for short text with immediate response. Best for quick conversions up to 2500 characters.

Streaming API

Stream audio in real time — via a single HTTP POST for simple pipelines, or a persistent WebSocket connection for interactive voice agents.

Not sure which one fits your latency and interactivity needs? See Which Text-to-Speech API to Use for a side-by-side comparison of REST, HTTP streaming, and WebSocket.

Supported Audio Formats & MIME Types

The TTS API supports over 8 major audio formats and MIME type variants. Supported formats and MIME types are listed below:

Format Group	Supported MIME Types
MP3 Variants	`mp3`
WAV Variants	`wav`
AAC Variants	`aac`
OPUS Format	`opus`
FLAC Variants (Lossless)	`flac`
PCM LINEAR16	`pcm`
MULAW (μ-law)	`mulaw`
ALAW (A-law)	`alaw`

Experience the voices: Head to dashboard.sarvam.ai to explore 30+ speaker voices, test different languages, and generate audio samples with custom input.

Limits

Limit	Value
REST API: max characters per request	2,500
HTTP streaming: max characters per request	3,500
WebSocket: max characters per message	2,500 (recommended under 500 for lowest latency)
Sample rates	8000 / 16000 / 22050 / 24000 Hz on all surfaces; 32000 / 44100 / 48000 Hz on REST and WebSocket only
Rate limits	Per plan — see Rate Limits

Next Steps

Choose Your API

Select the appropriate API type based on your use case.

Get API Key

Go Live

Deploy your integration and monitor usage in the dashboard.

Need help choosing the right API? Contact us on discord for guidance.