REST
Authentication
Request
The text(s) to be converted into speech.
Features:
- Supports code-mixed text (English and Indic languages)
Model-specific limits:
- bulbul:v3: Max 2500 characters
- bulbul:v2: Max 1500 characters
Important Note:
- For numbers larger than 4 digits, use commas (e.g., ‘10,000’ instead of ‘10000’)
- This ensures proper pronunciation as a whole number
The language code in BCP-47 format.
The speaker voice to be used for the output audio.
Default: shubh (for bulbul:v3), anushka (for bulbul:v2)
Model Compatibility (Speakers compatible with respective model):
- bulbul:v3:
- shubh (default), aditya, ritu, priya, neha, rahul, pooja, rohan, simran, kavya, amit, dev, ishita, shreya, ratan, varun, manan, sumit, roopa, kabir, aayan, ashutosh, advait, anand, tanya, tarun, sunny, mani, gokul, vijay, shruti, suhani, mohit, kavitha, rehan, soham, rupali
- bulbul:v2:
- Female: anushka, manisha, vidya, arya
- Male: abhilash, karun, hitesh
Note: Speaker selection must match the chosen model version.
Important: Speaker names are case-sensitive and must be lowercase (e.g., ritu not Ritu).
Controls the pitch of the audio. Lower values result in a deeper voice, while higher values make it sharper. The suitable range is between -0.75 and 0.75. Default is 0.0.
Note: This parameter is only supported for bulbul:v2. It is NOT supported for bulbul:v3.
Controls the speed of the audio. Lower values result in slower speech, while higher values make it faster. Default is 1.0.
Model-specific ranges:
- bulbul:v3: 0.5 to 2.0
- bulbul:v2: 0.3 to 3.0
Controls the loudness of the audio. Lower values result in quieter audio, while higher values make it louder. The suitable range is between 0.3 and 3.0. Default is 1.0.
Note: This parameter is only supported for bulbul:v2. It is NOT supported for bulbul:v3.
Specifies the sample rate of the output audio. Supported values are 8000, 16000, 22050, 24000, 32000, 44100, 48000 Hz.
Note: Higher sample rates (32000, 44100, 48000 Hz) are only available with bulbul:v3 via the REST API, not in streaming mode.
Default: 24000 Hz
Controls whether normalization of English words and numeric entities (e.g., numbers, dates) is performed. Set to true for better handling of mixed-language text.
Model-specific behavior:
- bulbul:v3: Not Supported
- bulbul:v2: Default is false
Specifies the model to use for text-to-speech conversion.
Available models:
- bulbul:v3: Latest model with improved quality, 30+ voices, pace, and temperature control
- bulbul:v2: Legacy model with pitch, loudness, and pace controls
Temperature controls how much randomness and expressiveness the TTS model uses while generating speech.
Lower values produce more stable and consistent output, while higher values sound more expressive but may introduce artifacts or errors. The suitable range is between 0.01 and 2.0. Default is 0.6.
Note: This parameter is only supported for bulbul:v3. It has no effect on bulbul:v2.
The ID of a pronunciation dictionary to apply during synthesis. When provided, matching words in the input text will be replaced with their custom pronunciations before generating speech.
Create and manage dictionaries via the Pronunciation Dictionary API. Only supported by bulbul:v3.
Enable caching for the request. When enabled, identical requests will return cached audio instead of regenerating. Default is false.
Note: Currently in beta and only available for bulbul:v1 and bulbul:v2 models.