POST
/
text-to-speech
curl --request POST \
  --url https://api.sarvam.ai/text-to-speech \
  --header 'Content-Type: application/json' \
  --data '{
  "inputs": [
    "<string>"
  ],
  "target_language_code": "bn-IN",
  "speaker": "meera",
  "pitch": 0,
  "pace": 1.65,
  "loudness": 1.55,
  "speech_sample_rate": 8000,
  "enable_preprocessing": false,
  "model": "bulbul:v1"
}'
{
  "request_id": "<string>",
  "audios": [
    "<string>"
  ]
}

Headers

api-subscription-key
string
default:

Your unique subscription key for authenticating requests to the Sarvam AI Speech-to-Text API. Here are the steps to get your api key

Body

application/json
inputs
string[]
required

The text(s) to be converted into speech. Each text should be no longer than 500 characters. You can send up to 3 texts in a single API call. The text can be code-mixed, combining English and Indic languages.

target_language_code
enum<string>
required

The language of the text is BCP-47 format

Available options:
bn-IN,
en-IN,
gu-IN,
hi-IN,
kn-IN,
ml-IN,
mr-IN,
od-IN,
pa-IN,
ta-IN,
te-IN
speaker
enum<string> | null
default:meera

The speaker to be used for the output audio. If not provided, Meera will be used as default.

Available options:
meera,
pavithra,
maitreyi,
arvind,
amol,
amartya,
diya,
neel,
misha,
vian,
arjun,
maya,
anushka,
abhilash,
manisha,
vidya,
arya,
karun,
hitesh
pitch
number | null
default:0

Controls the pitch of the audio. Lower values result in a deeper voice, while higher values make it sharper. The suitable range is between -0.75 and 0.75. Default is 0.0.

Required range: -1 <= x <= 1
pace
number | null
default:1

Controls the speed of the audio. Lower values result in slower speech, while higher values make it faster. The suitable range is between 0.5 and 2.0. Default is 1.0.

Required range: 0.3 <= x <= 3
loudness
number | null
default:1

Controls the loudness of the audio. Lower values result in quieter audio, while higher values make it louder. The suitable range is between 0.3 and 3.0. Default is 1.0.

Required range: 0.1 <= x <= 3
speech_sample_rate
enum<integer> | null
default:22050

Specifies the sample rate of the output audio. Supported values are 8000, 16000, and 22050 Hz. If not provided, the default is 22050 Hz.

Available options:
8000,
16000,
22050
enable_preprocessing
boolean
default:false

Controls whether normalization of English words and numeric entities (e.g., numbers, dates) is performed. Set to true for better handling of mixed-language text. Default is false.

model
enum<string>

Specifies the model to use for text-to-speech conversion. Default is bulbul:v1.

Available options:
bulbul:v1,
bulbul:v2

Response

200
application/json
Successful Response
request_id
string | null
required
audios
string[]
required

The output audio files in WAV format, encoded as base64 strings. Each string corresponds to one of the input texts.