Convert text into spoken audio. The output is a base64-encoded audio string that must be decoded before use.
Available Models:
Important Notes for bulbul:v3:
The text(s) to be converted into speech.
Features:
Model-specific limits:
Important Note:
The language code in BCP-47 format.
The speaker voice to be used for the output audio.
Default: shubh (for bulbul:v3), anushka (for bulbul:v2)
Model Compatibility (Speakers compatible with respective model):
Note: Speaker selection must match the chosen model version.
Important: Speaker names are case-sensitive and must be lowercase (e.g., ritu not Ritu).
Controls the pitch of the audio. Lower values result in a deeper voice, while higher values make it sharper. The suitable range is between -0.75 and 0.75. Default is 0.0.
Note: This parameter is only supported for bulbul:v2. It is NOT supported for bulbul:v3.
Controls the speed of the audio. Lower values result in slower speech, while higher values make it faster. Default is 1.0.
Model-specific ranges:
Controls the loudness of the audio. Lower values result in quieter audio, while higher values make it louder. The suitable range is between 0.3 and 3.0. Default is 1.0.
Note: This parameter is only supported for bulbul:v2. It is NOT supported for bulbul:v3.
Specifies the sample rate of the output audio. Supported values are 8000, 16000, 22050, 24000, 32000, 44100, 48000 Hz.
Note: Higher sample rates (32000, 44100, 48000 Hz) are only available with bulbul:v3 via the REST API, not in streaming mode.
Default: 24000 Hz
Controls whether normalization of English words and numeric entities (e.g., numbers, dates) is performed. Set to true for better handling of mixed-language text.
Model-specific behavior:
Specifies the model to use for text-to-speech conversion.
Available models:
Temperature controls how much randomness and expressiveness the TTS model uses while generating speech.
Lower values produce more stable and consistent output, while higher values sound more expressive but may introduce artifacts or errors. The suitable range is between 0.01 and 2.0. Default is 0.6.
Note: This parameter is only supported for bulbul:v3. It has no effect on bulbul:v2.
The ID of a pronunciation dictionary to apply during synthesis. When provided, matching words in the input text will be replaced with their custom pronunciations before generating speech.
Create and manage dictionaries via the Pronunciation Dictionary API. Only supported by bulbul:v3.
Enable caching for the request. When enabled, identical requests will return cached audio instead of regenerating. Default is false.
Note: Currently in beta and only available for bulbul:v1 and bulbul:v2 models.