HTTP Streaming API

POST /text-to-speech/stream — send text in, get a binary audio stream back. The response starts arriving as soon as the first audio chunk is ready, so you can begin playback or piping without waiting for the full file.

No WebSocket handshake, no config messages, no connection lifecycle. One HTTP request, one streamed response.

Common use cases:

  • Backend audio generation — Pipe audio directly to a file, S3, or a downstream service
  • API proxying — Forward the stream to your frontend or mobile client as-is
  • Batch processing — Generate audio for a queue of texts using simple HTTP calls
  • Serverless / edge — Works in any environment that supports HTTP — no WebSocket runtime needed

HTTP Stream vs WebSocket — When to Use Which

Both give you streaming audio. The difference is how much control you need.

HTTP StreamWebSocket
ProtocolSingle POST requestPersistent bidirectional connection
SetupZero — it’s a normal HTTP callHandshake + config message before first text
Endpoint/text-to-speech/stream/text-to-speech/ws
Text inputOne text payload per requestSend multiple texts on the same connection
Max text3500 characters2500 characters per message (send many)
Audio outputBinary stream (play/save directly)Base64-encoded chunks (decode each one)
Connection reuseNew connection per requestOne connection, many conversions
Best forOne-shot generation, server-side pipelines, simple integrationsVoice agents, interactive apps, multi-turn conversations

Use HTTP Stream when:

  • You have a complete text and just need audio back
  • You’re generating audio server-side (batch jobs, API endpoints, CI pipelines)
  • Your runtime doesn’t support WebSocket (serverless functions, edge workers)
  • You want the simplest possible integration — curl works out of the box

Use WebSocket when:

  • You’re building a conversational agent that streams text incrementally (e.g., from an LLM)
  • You need to send multiple texts without reconnecting
  • Low time-to-first-byte on successive utterances matters (connection is already warm)
  • You need fine-grained control over buffering and flushing

Code Examples

1from sarvamai import SarvamAI
2
3client = SarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")
4
5chunks = []
6for chunk in client.text_to_speech.convert_stream(
7 text="नमस्ते! Sarvam AI में आपका स्वागत है। हम India की हर language को voice देते हैं।",
8 target_language_code="hi-IN",
9 speaker="shubh",
10 model="bulbul:v3",
11 output_audio_codec="mp3",
12):
13 chunks.append(chunk)
14
15audio = b"".join(chunks)
16with open("output.mp3", "wb") as f:
17 f.write(audio)
18print(f"Saved output.mp3 ({len(audio)} bytes)")

Piping the Stream

Since the response is a raw binary audio stream, you can pipe it directly without buffering the whole file in memory.

1from sarvamai import SarvamAI
2
3client = SarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")
4
5with open("output.mp3", "wb") as f:
6 for chunk in client.text_to_speech.convert_stream(
7 text="भारत की संस्कृति विश्व की सबसे प्राचीन और समृद्ध संस्कृतियों में से एक है।",
8 target_language_code="hi-IN",
9 speaker="shubh",
10 model="bulbul:v3",
11 output_audio_codec="mp3",
12 ):
13 f.write(chunk)

Request Parameters

ParameterTypeRequiredDefaultDescription
textstringYesText to convert. Max 3500 characters. Supports code-mixed text.
target_language_codestringNoen-INBCP-47 language code (hi-IN, ta-IN, en-IN, etc.)
speakerstringNoshubhSpeaker voice. See voice list.
modelstringNobulbul:v2bulbul:v3 (recommended) or bulbul:v2
output_audio_codecstringNomp3mp3, wav, aac, opus, flac, linear16, mulaw, alaw
output_audio_bitratestringNo128k32k, 64k, 128k, 192k, 256k
pacenumberNo1.0Speech speed. v3: 0.52.0, v2: 0.33.0
speech_sample_ratenumberNo22050Output sample rate in Hz
temperaturenumberNo0.6Expressiveness. 0.011.0. v3 only.
dict_idstringNoPronunciation dictionary ID. v3 only.
enable_preprocessingbooleanNofalseNormalize English words and numbers before synthesis
enable_cached_responsesbooleanNofalseEnable response caching (beta)

Response

The response is a binary audio stream — not JSON, not base64. The Content-Type header matches your requested codec (e.g., audio/mpeg for MP3).

You can:

  • Save it directly to a file (--output in cURL, f.write(chunk) in Python)
  • Pipe it to an audio player
  • Forward it to a client as a streaming HTTP response

This is different from the REST endpoint (/text-to-speech) which returns base64-encoded audio inside a JSON response. The stream endpoint returns raw binary audio — no decoding needed.


With Pronunciation Dictionary

Pass dict_id to apply custom pronunciations during streaming synthesis:

1from sarvamai import SarvamAI
2
3client = SarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")
4
5with open("output.mp3", "wb") as f:
6 for chunk in client.text_to_speech.convert_stream(
7 text="NEFT transfer karein aur KYC complete karein",
8 target_language_code="hi-IN",
9 speaker="shubh",
10 model="bulbul:v3",
11 dict_id="p_5cb7faa6",
12 output_audio_codec="mp3",
13 ):
14 f.write(chunk)

See the Pronunciation Dictionary guide for setup.


Error Handling

Errors return JSON (not audio) with the standard error format:

1{
2 "error": {
3 "message": "Text exceeds maximum length of 3500 characters",
4 "code": "unprocessable_entity_error"
5 }
6}
HTTP StatusError CodeWhen
400invalid_request_errorMissing or malformed parameters
403invalid_api_key_errorInvalid or missing API key
422unprocessable_entity_errorText too long, invalid speaker/model
429insufficient_quota_errorRate limit or quota exceeded
500internal_server_errorServer error — retry
1from sarvamai import SarvamAI
2from sarvamai.core.api_error import ApiError
3
4client = SarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")
5
6try:
7 for chunk in client.text_to_speech.convert_stream(
8 text="Hello from Sarvam AI!",
9 target_language_code="en-IN",
10 speaker="shubh",
11 model="bulbul:v3",
12 output_audio_codec="mp3",
13 ):
14 pass # process chunk
15except ApiError as e:
16 print(f"Error {e.status_code}: {e.body}")

Full endpoint spec with all parameters and error details is in the API Reference.

Need help? Reach out on Discord.