Text to Speech

This is the model to convert text into spoken audio. The output is a wave file encoded as a base64 string.

Headers

api-subscription-keystringRequired

Request

This endpoint expects an object.
target_language_codeenumRequired

The language of the text is BCP-47 format

textstringOptional>=1 character<=1500 characters

The text(s) to be converted into speech.

Features:

  • Each text should be no longer than 1500 characters
  • Supports code-mixed text (English and Indic languages)

Important Note:

  • For numbers larger than 4 digits, use commas (e.g., ‘10,000’ instead of ‘10000’)
  • This ensures proper pronunciation as a whole number
speakerenumOptional

The speaker voice to be used for the output audio.

Default: Meera

Model Compatibility (Speakers compatible with respective models):

  • bulbul:v1:

    • Female: Diya, Maya, Meera, Pavithra, Maitreyi, Misha
    • Male: Amol, Arjun, Amartya, Arvind, Neel, Vian
  • bulbul:v2:

    • Female: Anushka, Manisha, Vidya, Arya
    • Male: Abhilash, Karun, Hitesh

Note: Speaker selection must match the chosen model version.

pitchdoubleOptional

Controls the pitch of the audio. Lower values result in a deeper voice, while higher values make it sharper. The suitable range is between -0.75 and 0.75. Default is 0.0.

pacedoubleOptional>=0.3<=3

Controls the speed of the audio. Lower values result in slower speech, while higher values make it faster. The suitable range is between 0.5 and 2.0. Default is 1.0.

loudnessdoubleOptional>=0.1<=3

Controls the loudness of the audio. Lower values result in quieter audio, while higher values make it louder. The suitable range is between 0.3 and 3.0. Default is 1.0.

speech_sample_rateintegerOptional

Specifies the sample rate of the output audio. Supported values are 8000, 16000, 22050, 24000 Hz. If not provided, the default is 22050 Hz.

enable_preprocessingbooleanOptionalDefaults to false

Controls whether normalization of English words and numeric entities (e.g., numbers, dates) is performed. Set to true for better handling of mixed-language text. Default is false.

modelenumOptional

Specifies the model to use for text-to-speech conversion. Default is bulbul:v1.

Allowed values:

Response

Successful Response

audioslist of strings

The output audio files in WAV format, encoded as base64 strings. Each string corresponds to one of the input texts.

request_idstringOptional

Errors