Converts the input text into a streamed spoken audio response.
This endpoint supports streaming audio using the specified output codec (e.g., `audio/mpeg` for MP3). The response is returned as a binary audio stream, which can be played or saved directly by the client.
Supports the `dict_id` parameter to apply a [pronunciation dictionary](https://docs.sarvam.ai/api-reference-docs/pronunciation-dictionary/create) during synthesis.
Authentication
api-subscription-keystring
API Key authentication via header
Request
This endpoint expects an object.
textstringRequired1-3500 characters
The text to be converted into streamed speech.
**Features:**
- Max 3500 characters
- Supports code-mixed text (English and Indic languages)
**Important Note:**
- For numbers larger than 4 digits, use commas (e.g., '10,000' instead of '10000')
- This ensures proper pronunciation as a whole number
target_language_codeenumOptional
The language code in BCP-47 format.
speakerenum or nullOptional
The speaker voice to be used for the output audio.
**Default:** shubh (for bulbul:v3), anushka (for bulbul:v2)
**Note:** Speaker selection must match the chosen model version.
**Important:** Speaker names are case-sensitive and must be lowercase (e.g., `ritu` not `Ritu`).
pitchdouble or nullOptional-1-1Defaults to 0
Controls the pitch of the audio. Range: -0.75 to 0.75. Default is 0.0.
Note: Only supported for bulbul:v2.
pacedouble or nullOptional0.3-3Defaults to 1
Controls the speed of the audio. Default is 1.0.
**Model-specific ranges:**
- **bulbul:v3:** 0.5 to 2.0
- **bulbul:v2:** 0.3 to 3.0
loudnessdouble or nullOptional0.1-3Defaults to 1
Controls the loudness of the audio. Range: 0.3 to 3.0. Default is 1.0.
Note: Only supported for bulbul:v2.
speech_sample_rateenum or nullOptional
Specifies the sample rate of the output audio. Default is 22050 Hz.
Note: OPUS codec only supports 8000, 12000, 16000, 24000, 48000 Hz.
enable_preprocessingbooleanOptionalDefaults to false
Controls whether normalization of English words and numeric entities is performed. Default is false.
modelenumOptional
Specifies the model to use for text-to-speech conversion. Default is bulbul:v2.
Allowed values:
temperaturedouble or nullOptional0.01-1Defaults to 0.6
Controls the randomness of the output. Range: 0.01 to 1.0. Default is 0.6.
Note: Only supported for bulbul:v3.
enable_cached_responsesbooleanOptionalDefaults to false
Enable caching for the request. Default is false. Currently in beta.
dict_idstring or nullOptional
The ID of a pronunciation dictionary to apply during synthesis. When provided, matching words in the input text will be replaced with their custom pronunciations before generating speech.
Create and manage dictionaries via the [Pronunciation Dictionary API](https://docs.sarvam.ai/api-reference-docs/pronunciation-dictionary/create). Only supported by **bulbul:v3**.
output_audio_codecenumOptional
Specifies the codec for the streamed output audio (e.g., ‘mp3’).
output_audio_bitrateenumOptional
Bitrate for the streamed output audio. Default is '128k'.
Allowed values:
Response
Success. Returns a streamed audio response in the requested format (e.g., audio/mpeg for MP3, audio/wav for WAV).
Converts the input text into a streamed spoken audio response.
This endpoint supports streaming audio using the specified output codec (e.g., audio/mpeg for MP3). The response is returned as a binary audio stream, which can be played or saved directly by the client.
Note: Speaker selection must match the chosen model version.
Important: Speaker names are case-sensitive and must be lowercase (e.g., ritu not Ritu).
Controls the speed of the audio. Default is 1.0.
Model-specific ranges:
bulbul:v3: 0.5 to 2.0
bulbul:v2: 0.3 to 3.0
The ID of a pronunciation dictionary to apply during synthesis. When provided, matching words in the input text will be replaced with their custom pronunciations before generating speech.