Converts the input text into a streamed spoken audio response.
This endpoint supports streaming audio using the specified output codec (e.g., audio/mpeg for MP3). The response is returned as a binary audio stream, which can be played or saved directly by the client.
Supports the dict_id parameter to apply a pronunciation dictionary during synthesis.
The text to be converted into streamed speech.
Features:
Important Note:
The language code in BCP-47 format.
The speaker voice to be used for the output audio.
Default: shubh (for bulbul:v3), anushka (for bulbul:v2)
Note: Speaker selection must match the chosen model version.
Important: Speaker names are case-sensitive and must be lowercase (e.g., ritu not Ritu).
Controls the pitch of the audio. Range: -0.75 to 0.75. Default is 0.0.
Note: Only supported for bulbul:v2.
Controls the speed of the audio. Default is 1.0.
Model-specific ranges:
Controls the loudness of the audio. Range: 0.3 to 3.0. Default is 1.0.
Note: Only supported for bulbul:v2.
Specifies the sample rate of the output audio. Default is 22050 Hz.
Note: OPUS codec only supports 8000, 12000, 16000, 24000, 48000 Hz.
Specifies the model to use for text-to-speech conversion. Default is bulbul:v2.
Controls the randomness of the output. Range: 0.01 to 1.0. Default is 0.6.
Note: Only supported for bulbul:v3.
The ID of a pronunciation dictionary to apply during synthesis. When provided, matching words in the input text will be replaced with their custom pronunciations before generating speech.
Create and manage dictionaries via the Pronunciation Dictionary API. Only supported by bulbul:v3.
Specifies the codec for the streamed output audio (e.g., ‘mp3’).
Success. Returns a streamed audio response in the requested format (e.g., audio/mpeg for MP3, audio/wav for WAV).