REST Stream

Converts the input text into a streamed spoken audio response. This endpoint supports streaming audio using the specified output codec (e.g., `audio/mpeg` for MP3). The response is returned as a binary audio stream, which can be played or saved directly by the client. Supports the `dict_id` parameter to apply a [pronunciation dictionary](https://docs.sarvam.ai/api-reference-docs/pronunciation-dictionary/create) during synthesis.

Authentication

api-subscription-keystring
API Key authentication via header

Request

This endpoint expects an object.
textstringRequired1-3500 characters

The text to be converted into streamed speech.

Features:

  • Max 3500 characters
  • Supports code-mixed text (English and Indic languages)

Important Note:

  • For numbers larger than 4 digits, use commas (e.g., ‘10,000’ instead of ‘10000’)
  • This ensures proper pronunciation as a whole number
target_language_codeenumOptional

The language code in BCP-47 format.

speakerenum or nullOptional

The speaker voice to be used for the output audio.

Default: shubh (for bulbul:v3), anushka (for bulbul:v2)

Note: Speaker selection must match the chosen model version.

Important: Speaker names are case-sensitive and must be lowercase (e.g., ritu not Ritu).

pitchdouble or nullOptional-1-1

Controls the pitch of the audio. Range: -0.75 to 0.75. Default is 0.0.

Note: Only supported for bulbul:v2.

pacedouble or nullOptional0.3-3Defaults to 1

Controls the speed of the audio. Default is 1.0.

Model-specific ranges:

  • bulbul:v3: 0.5 to 2.0
  • bulbul:v2: 0.3 to 3.0
loudnessdouble or nullOptional0.1-3

Controls the loudness of the audio. Range: 0.3 to 3.0. Default is 1.0.

Note: Only supported for bulbul:v2.

speech_sample_rateenum or nullOptional

Specifies the sample rate of the output audio. Default is 22050 Hz.

Note: OPUS codec only supports 8000, 12000, 16000, 24000, 48000 Hz.

enable_preprocessingbooleanOptionalDefaults to false
Controls whether normalization of English words and numeric entities is performed. Default is false.
modelenumOptional

Specifies the model to use for text-to-speech conversion. Default is bulbul:v2.

Allowed values:
temperaturedouble or nullOptional0.01-1Defaults to 0.6

Controls the randomness of the output. Range: 0.01 to 1.0. Default is 0.6.

Note: Only supported for bulbul:v3.

enable_cached_responsesbooleanOptionalDefaults to false
Enable caching for the request. Default is false. Currently in beta.
dict_idstring or nullOptional

The ID of a pronunciation dictionary to apply during synthesis. When provided, matching words in the input text will be replaced with their custom pronunciations before generating speech.

Create and manage dictionaries via the Pronunciation Dictionary API. Only supported by bulbul:v3.

output_audio_codecenumOptional

Specifies the codec for the streamed output audio (e.g., ‘mp3’).

output_audio_bitrateenumOptional
Bitrate for the streamed output audio. Default is '128k'.
Allowed values:

Response

Success. Returns a streamed audio response in the requested format (e.g., audio/mpeg for MP3, audio/wav for WAV).

Errors

400
Bad Request Error
403
Forbidden Error
422
Unprocessable Entity Error
429
Too Many Requests Error
500
Internal Server Error