HTTP Streaming API | Sarvam API Docs

POST /text-to-speech/stream — send text in, get a binary audio stream back. The response starts arriving as soon as the first audio chunk is ready, so you can begin playback or piping without waiting for the full file.

No WebSocket handshake, no config messages, no connection lifecycle. One HTTP request, one streamed response.

Common use cases:

Backend audio generation — Pipe audio directly to a file, S3, or a downstream service
API proxying — Forward the stream to your frontend or mobile client as-is
Batch processing — Generate audio for a queue of texts using simple HTTP calls
Serverless / edge — Works in any environment that supports HTTP — no WebSocket runtime needed

HTTP Stream vs WebSocket — When to Use Which

Both give you streaming audio. The difference is how much control you need.

	HTTP Stream	WebSocket
Protocol	Single `POST` request	Persistent bidirectional connection
Setup	Zero — it’s a normal HTTP call	Handshake + config message before first text
Endpoint	`/text-to-speech/stream`	`/text-to-speech/ws`
Text input	One text payload per request	Send multiple texts on the same connection
Max text	3500 characters	2500 characters per message (send many)
Audio output	Binary stream (play/save directly)	Base64-encoded chunks (decode each one)
Connection reuse	New connection per request	One connection, many conversions
Best for	One-shot generation, server-side pipelines, simple integrations	Voice agents, interactive apps, multi-turn conversations

Use HTTP Stream when:

You have a complete text and just need audio back
You’re generating audio server-side (batch jobs, API endpoints, CI pipelines)
Your runtime doesn’t support WebSocket (serverless functions, edge workers)
You want the simplest possible integration — curl works out of the box

Use WebSocket when:

You’re building a conversational agent that streams text incrementally (e.g., from an LLM)
You need to send multiple texts without reconnecting
Low time-to-first-byte on successive utterances matters (connection is already warm)
You need fine-grained control over buffering and flushing

Code Examples

1 from sarvamai import SarvamAI
2 
3 client = SarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")
4 
5 chunks = []
6 for chunk in client.text_to_speech.convert_stream(
7     text="नमस्ते! Sarvam AI में आपका स्वागत है। हम India की हर language को voice देते हैं।",
8     target_language_code="hi-IN",
9     speaker="shubh",
10     model="bulbul:v3",
11     output_audio_codec="mp3",
12 ):
13     chunks.append(chunk)
14 
15 audio = b"".join(chunks)
16 with open("output.mp3", "wb") as f:
17     f.write(audio)
18 print(f"Saved output.mp3 ({len(audio)} bytes)")

Piping the Stream

Since the response is a raw binary audio stream, you can pipe it directly without buffering the whole file in memory.

1 from sarvamai import SarvamAI
2 
3 client = SarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")
4 
5 with open("output.mp3", "wb") as f:
6     for chunk in client.text_to_speech.convert_stream(
7         text="भारत की संस्कृति विश्व की सबसे प्राचीन और समृद्ध संस्कृतियों में से एक है।",
8         target_language_code="hi-IN",
9         speaker="shubh",
10         model="bulbul:v3",
11         output_audio_codec="mp3",
12     ):
13         f.write(chunk)

Request Parameters

Parameter	Type	Required	Default	Description
`text`	string	Yes	—	Text to convert. Max 3500 characters. Supports code-mixed text.
`target_language_code`	string	No	`en-IN`	BCP-47 language code (`hi-IN`, `ta-IN`, `en-IN`, etc.)
`speaker`	string	No	`shubh`	Speaker voice. See voice list.
`model`	string	No	`bulbul:v2`	`bulbul:v3` (recommended) or `bulbul:v2`
`output_audio_codec`	string	No	`mp3`	`mp3`, `wav`, `aac`, `opus`, `flac`, `linear16`, `mulaw`, `alaw`
`output_audio_bitrate`	string	No	`128k`	`32k`, `64k`, `128k`, `192k`, `256k`
`pace`	number	No	`1.0`	Speech speed. v3: `0.5`–`2.0`, v2: `0.3`–`3.0`
`speech_sample_rate`	number	No	`22050`	Output sample rate in Hz
`temperature`	number	No	`0.6`	Expressiveness. `0.01`–`1.0`. v3 only.
`dict_id`	string	No	—	Pronunciation dictionary ID. v3 only.
`enable_preprocessing`	boolean	No	`false`	Normalize English words and numbers before synthesis
`enable_cached_responses`	boolean	No	`false`	Enable response caching (beta)

Response

The response is a binary audio stream — not JSON, not base64. The Content-Type header matches your requested codec (e.g., audio/mpeg for MP3).

You can:

Save it directly to a file (--output in cURL, f.write(chunk) in Python)
Pipe it to an audio player
Forward it to a client as a streaming HTTP response

This is different from the REST endpoint (/text-to-speech) which returns base64-encoded audio inside a JSON response. The stream endpoint returns raw binary audio — no decoding needed.

With Pronunciation Dictionary

Pass dict_id to apply custom pronunciations during streaming synthesis:

1 from sarvamai import SarvamAI
2 
3 client = SarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")
4 
5 with open("output.mp3", "wb") as f:
6     for chunk in client.text_to_speech.convert_stream(
7         text="NEFT transfer karein aur KYC complete karein",
8         target_language_code="hi-IN",
9         speaker="shubh",
10         model="bulbul:v3",
11         dict_id="p_5cb7faa6",
12         output_audio_codec="mp3",
13     ):
14         f.write(chunk)

See the Pronunciation Dictionary guide for setup.

Error Handling

Errors return JSON (not audio) with the standard error format:

1 {
2   "error": {
3     "message": "Text exceeds maximum length of 3500 characters",
4     "code": "unprocessable_entity_error"
5   }
6 }

The full error-code table, retry guidance, and SDK exception reference live on the central Errors & Troubleshooting page. Specific to this endpoint:

HTTP Status	Error Code	When
`422`	`unprocessable_entity_error`	Text too long (max 3500 characters), invalid speaker/model

1 from sarvamai import SarvamAI
2 from sarvamai.core.api_error import ApiError
3 
4 client = SarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")
5 
6 try:
7     for chunk in client.text_to_speech.convert_stream(
8         text="Hello from Sarvam AI!",
9         target_language_code="en-IN",
10         speaker="shubh",
11         model="bulbul:v3",
12         output_audio_codec="mp3",
13     ):
14         pass  # process chunk
15 except ApiError as e:
16     print(f"Error {e.status_code}: {e.body}")

Full endpoint spec with all parameters and error details is in the API Reference.

Need help? Reach out on Discord.