REST | Sarvam API Docs

Convert text into spoken audio. The output is a base64-encoded audio string that must be decoded before use.

Available Models:

bulbul:v3: Latest model with improved quality, 30+ voices, and temperature control
bulbul:v2: Legacy model with pitch and loudness controls

Important Notes for bulbul:v3:

Pitch and loudness parameters are NOT supported
Pace range: 0.5 to 2.0
Preprocessing is automatically enabled
Default sample rate is 24000 Hz
Supports sample rates: 8000, 16000, 22050, 24000 Hz (REST API also supports 32000, 44100, 48000 Hz)

Convert text into spoken audio. The output is a base64-encoded audio string that must be decoded before use. **Available Models:** - **bulbul:v3**: Latest model with improved quality, 30+ voices, and temperature control - **bulbul:v2**: Legacy model with pitch and loudness controls **Important Notes for bulbul:v3:** - Pitch and loudness parameters are NOT supported - Pace range: 0.5 to 2.0 - Preprocessing is automatically enabled - Default sample rate is 24000 Hz - Supports sample rates: 8000, 16000, 22050, 24000 Hz (REST API also supports 32000, 44100, 48000 Hz)

Authentication

api-subscription-keystring

API Key authentication via header

Request

This endpoint expects an object.

textstringRequired

The text(s) to be converted into speech.

Features:

Supports code-mixed text (English and Indic languages)

Model-specific limits:

bulbul:v3: Max 2500 characters
bulbul:v2: Max 1500 characters

Important Note:

For numbers larger than 4 digits, use commas (e.g., ‘10,000’ instead of ‘10000’)
This ensures proper pronunciation as a whole number

The text(s) to be converted into speech. **Features:** - Supports code-mixed text (English and Indic languages) **Model-specific limits:** - **bulbul:v3:** Max 2500 characters - **bulbul:v2:** Max 1500 characters **Important Note:** - For numbers larger than 4 digits, use commas (e.g., '10,000' instead of '10000') - This ensures proper pronunciation as a whole number

target_language_codeenumRequired

The language code in BCP-47 format.

speakerenum or nullOptional

The speaker voice to be used for the output audio.

Default: Shubh (for bulbul:v3), Anushka (for bulbul:v2)

Model Compatibility (Speakers compatible with respective model):

bulbul:v3:
- Shubh (default), Aditya, Ritu, Priya, Neha, Rahul, Pooja, Rohan, Simran, Kavya, Amit, Dev, Ishita, Shreya, Ratan, Varun, Manan, Sumit, Roopa, Kabir, Aayan, Ashutosh, Advait, Amelia, Sophia, Anand, Tanya, Tarun, Sunny, Mani, Gokul, Vijay, Shruti, Suhani, Mohit, Kavitha, Rehan, Soham, Rupali
bulbul:v2:
- Female: Anushka, Manisha, Vidya, Arya
- Male: Abhilash, Karun, Hitesh

Note: Speaker selection must match the chosen model version.

The speaker voice to be used for the output audio. **Default:** Shubh (for bulbul:v3), Anushka (for bulbul:v2) **Model Compatibility (Speakers compatible with respective model):** - **bulbul:v3:** - Shubh (default), Aditya, Ritu, Priya, Neha, Rahul, Pooja, Rohan, Simran, Kavya, Amit, Dev, Ishita, Shreya, Ratan, Varun, Manan, Sumit, Roopa, Kabir, Aayan, Ashutosh, Advait, Amelia, Sophia, Anand, Tanya, Tarun, Sunny, Mani, Gokul, Vijay, Shruti, Suhani, Mohit, Kavitha, Rehan, Soham, Rupali - **bulbul:v2:** - Female: Anushka, Manisha, Vidya, Arya - Male: Abhilash, Karun, Hitesh **Note:** Speaker selection must match the chosen model version.

pitchdouble or nullOptionalDefaults to 0

Controls the pitch of the audio. Lower values result in a deeper voice, while higher values make it sharper. The suitable range is between -0.75 and 0.75. Default is 0.0.

Note: This parameter is only supported for bulbul:v2. It is NOT supported for bulbul:v3.

pacedouble or nullOptional0.3-3Defaults to 1

Controls the speed of the audio. Lower values result in slower speech, while higher values make it faster. Default is 1.0.

Model-specific ranges:

bulbul:v3: 0.5 to 2.0
bulbul:v2: 0.3 to 3.0

Controls the speed of the audio. Lower values result in slower speech, while higher values make it faster. Default is 1.0. **Model-specific ranges:** - **bulbul:v3:** 0.5 to 2.0 - **bulbul:v2:** 0.3 to 3.0

loudnessdouble or nullOptional0.1-3Defaults to 1

Controls the loudness of the audio. Lower values result in quieter audio, while higher values make it louder. The suitable range is between 0.3 and 3.0. Default is 1.0.

Note: This parameter is only supported for bulbul:v2. It is NOT supported for bulbul:v3.

speech_sample_rateenum or nullOptional

Specifies the sample rate of the output audio. Supported values are 8000, 16000, 22050, 24000, 32000, 44100, 48000 Hz.

Note: Higher sample rates (32000, 44100, 48000 Hz) are only available with bulbul:v3 via the REST API, not in streaming mode.

Default: 24000 Hz

enable_preprocessingbooleanOptionalDefaults to false

Controls whether normalization of English words and numeric entities (e.g., numbers, dates) is performed. Set to true for better handling of mixed-language text.

Model-specific behavior:

bulbul:v3: Not Supported
bulbul:v2: Default is false

Controls whether normalization of English words and numeric entities (e.g., numbers, dates) is performed. Set to true for better handling of mixed-language text. **Model-specific behavior:** - **bulbul:v3:** Not Supported - **bulbul:v2:** Default is false

modelenumOptional

Specifies the model to use for text-to-speech conversion.

Available models:

bulbul:v3: Latest model with improved quality, 30+ voices, pace, and temperature control
bulbul:v2: Legacy model with pitch, loudness, and pace controls

Specifies the model to use for text-to-speech conversion. **Available models:** - **bulbul:v3:** Latest model with improved quality, 30+ voices, pace, and temperature control - **bulbul:v2:** Legacy model with pitch, loudness, and pace controls

Allowed values:

output_audio_codecenum or nullOptional

Specifies the audio codec for the output audio file. Different codecs offer various compression and quality characteristics.

temperaturedouble or nullOptional0.01-2Defaults to 0.6

Temperature controls how much randomness and expressiveness the TTS model uses while generating speech.

Lower values produce more stable and consistent output, while higher values sound more expressive but may introduce artifacts or errors. The suitable range is between 0.01 and 2.0. Default is 0.6.

Note: This parameter is only supported for bulbul:v3. It has no effect on bulbul:v2.

Temperature controls how much randomness and expressiveness the TTS model uses while generating speech. Lower values produce more stable and consistent output, while higher values sound more expressive but may introduce artifacts or errors. The suitable range is between 0.01 and 2.0. Default is 0.6. **Note:** This parameter is only supported for bulbul:v3. It has no effect on bulbul:v2.

dict_idstring or nullOptional

The ID of a pronunciation dictionary to apply during synthesis. When provided, matching words in the input text will be replaced with their custom pronunciations before generating speech.

Create and manage dictionaries via the Pronunciation Dictionary API. Only supported by bulbul:v3.

The ID of a pronunciation dictionary to apply during synthesis. When provided, matching words in the input text will be replaced with their custom pronunciations before generating speech. Create and manage dictionaries via the [Pronunciation Dictionary API](https://docs.sarvam.ai/api-reference-docs/pronunciation-dictionary/create). Only supported by **bulbul:v3**.

enable_cached_responsesbooleanOptionalDefaults to false

Enable caching for the request. When enabled, identical requests will return cached audio instead of regenerating. Default is false.

Note: Currently in beta and only available for bulbul:v1 and bulbul:v2 models.

Response

Successful Response

request_idstring or null

audioslist of strings

The output audio files in WAV format, encoded as base64 strings. Each string corresponds to one of the input texts.

1	import { SarvamAIClient } from "sarvamai";
2
3	async function main() {
4	const client = new SarvamAIClient({
5	apiSubscriptionKey: "YOUR_API_KEY_HERE",
6	});
7	await client.textToSpeech.convert({
8	text: "Hello, welcome to Sarvam's text-to-speech service. This demo converts your text into natural spoken audio.",
9	target_language_code: "en-IN",
10	});
11	}
12	main();

1	{
2	"request_id": "a1b2c3d4-e5f6-7890-ab12-cd34ef567890",
3	"audios": [
4	"UklGRiQAAABXQVZFZm10IBAAAAABAAEAQB8AAIA+AAACABAAZGF0YQAAAAA="
5	]
6	}

Authentication

Request

Response

Errors