REST Stream

Converts the input text into a streamed spoken audio response. This endpoint supports streaming audio using the specified output codec (e.g., `audio/mpeg` for MP3). The response is returned as a binary audio stream, which can be played or saved directly by the client. Supports the `dict_id` parameter to apply a [pronunciation dictionary](https://docs.sarvam.ai/api-reference-docs/pronunciation-dictionary/create) during synthesis.

Authentication

api-subscription-keystring
API Key authentication via header

Request

This endpoint expects an object.
textstringRequired1-3500 characters
The text to be converted into streamed speech. **Features:** - Max 3500 characters - Supports code-mixed text (English and Indic languages) **Important Note:** - For numbers larger than 4 digits, use commas (e.g., '10,000' instead of '10000') - This ensures proper pronunciation as a whole number
target_language_codeenumOptional

The language code in BCP-47 format.

speakerenum or nullOptional

The speaker voice to be used for the output audio.

Default: Shubh (for bulbul:v3), Anushka (for bulbul:v2)

Note: Speaker selection must match the chosen model version.

pitchdouble or nullOptional-1-1Defaults to 0

Controls the pitch of the audio. Range: -0.75 to 0.75. Default is 0.0.

Note: Only supported for bulbul:v2.

pacedouble or nullOptional0.3-3Defaults to 1
Controls the speed of the audio. Default is 1.0. **Model-specific ranges:** - **bulbul:v3:** 0.5 to 2.0 - **bulbul:v2:** 0.3 to 3.0
loudnessdouble or nullOptional0.1-3Defaults to 1

Controls the loudness of the audio. Range: 0.3 to 3.0. Default is 1.0.

Note: Only supported for bulbul:v2.

speech_sample_rateenum or nullOptional

Specifies the sample rate of the output audio. Default is 22050 Hz.

Note: OPUS codec only supports 8000, 12000, 16000, 24000, 48000 Hz.

enable_preprocessingbooleanOptionalDefaults to false
Controls whether normalization of English words and numeric entities is performed. Default is false.
modelenumOptional

Specifies the model to use for text-to-speech conversion. Default is bulbul:v2.

Allowed values:
temperaturedouble or nullOptional0.01-1Defaults to 0.6

Controls the randomness of the output. Range: 0.01 to 1.0. Default is 0.6.

Note: Only supported for bulbul:v3.

enable_cached_responsesbooleanOptionalDefaults to false
Enable caching for the request. Default is false. Currently in beta.
dict_idstring or nullOptional
The ID of a pronunciation dictionary to apply during synthesis. When provided, matching words in the input text will be replaced with their custom pronunciations before generating speech. Create and manage dictionaries via the [Pronunciation Dictionary API](https://docs.sarvam.ai/api-reference-docs/pronunciation-dictionary/create). Only supported by **bulbul:v3**.
output_audio_codecenumOptional

Specifies the codec for the streamed output audio (e.g., ‘mp3’).

output_audio_bitrateenumOptional
Bitrate for the streamed output audio. Default is '128k'.
Allowed values:

Response

Success. Returns a streamed audio response in the requested format (e.g., audio/mpeg for MP3, audio/wav for WAV).

Errors