REST Stream

POST

https://api.sarvam.ai/text-to-speech/stream

Converts the input text into a streamed spoken audio response. This endpoint supports streaming audio using the specified output codec (e.g., `audio/mpeg` for MP3). The response is returned as a binary audio stream, which can be played or saved directly by the client. Supports the `dict_id` parameter to apply a [pronunciation dictionary](https://docs.sarvam.ai/api-reference-docs/pronunciation-dictionary/create) during synthesis.

REST Stream

POST

https://api.sarvam.ai/text-to-speech/stream

Converts the input text into a streamed spoken audio response.

This endpoint supports streaming audio using the specified output codec (e.g., audio/mpeg for MP3). The response is returned as a binary audio stream, which can be played or saved directly by the client.

Supports the dict_id parameter to apply a pronunciation dictionary during synthesis.

Authentication

api-subscription-keystring

API Key authentication via header

Request

This endpoint expects an object.

textstringRequired1-3500 characters

The text to be converted into streamed speech.

Features:

Max 3500 characters
Supports code-mixed text (English and Indic languages)

Important Note:

For numbers larger than 4 digits, use commas (e.g., ‘10,000’ instead of ‘10000’)
This ensures proper pronunciation as a whole number

target_language_codeenumOptional

The language code in BCP-47 format.

speakerenum or nullOptional

The speaker voice to be used for the output audio.

Default: shubh (for bulbul:v3), anushka (for bulbul:v2)

Note: Speaker selection must match the chosen model version.

Important: Speaker names are case-sensitive and must be lowercase (e.g., ritu not Ritu).

pitchdouble or nullOptional-1-1

Controls the pitch of the audio. Range: -0.75 to 0.75. Default is 0.0.

Note: Only supported for bulbul:v2.

pacedouble or nullOptional0.3-3Defaults to 1

Controls the speed of the audio. Default is 1.0.

Model-specific ranges:

bulbul:v3: 0.5 to 2.0
bulbul:v2: 0.3 to 3.0

loudnessdouble or nullOptional0.1-3

Controls the loudness of the audio. Range: 0.3 to 3.0. Default is 1.0.

Note: Only supported for bulbul:v2.

speech_sample_rateenum or nullOptionalDefaults to 22050

Specifies the sample rate of the output audio. Default is 22050 Hz.

Note: OPUS codec only supports 8000, 12000, 16000, 24000, 48000 Hz.

enable_preprocessingbooleanOptionalDefaults to false

Controls whether normalization of English words and numeric entities is performed. Default is false.

modelenumOptional

Specifies the model to use for text-to-speech conversion. Default is bulbul:v2.

Allowed values:

temperaturedouble or nullOptional0.01-1Defaults to 0.6

Controls the randomness of the output. Range: 0.01 to 1.0. Default is 0.6.

Note: Only supported for bulbul:v3.

enable_cached_responsesbooleanOptionalDefaults to false

Enable caching for the request. Default is false. Currently in beta.

dict_idstring or nullOptional

The ID of a pronunciation dictionary to apply during synthesis. When provided, matching words in the input text will be replaced with their custom pronunciations before generating speech.

Create and manage dictionaries via the Pronunciation Dictionary API. Only supported by bulbul:v3.

output_audio_codecenumOptional

Specifies the codec for the streamed output audio (e.g., ‘mp3’).

output_audio_bitrateenumOptional

Bitrate for the streamed output audio. Default is '128k'.

Allowed values:

Response

Success. Returns a streamed audio response in the requested format (e.g., audio/mpeg for MP3, audio/wav for WAV).

Errors

400

Bad Request Error

403

Forbidden Error

422

Unprocessable Entity Error

429

Too Many Requests Error

500

Internal Server Error

Authentication

api-subscription-keystring

API Key authentication via header

Request

This endpoint expects an object.

textstringRequired1-3500 characters

The text to be converted into streamed speech.

Features:

Max 3500 characters
Supports code-mixed text (English and Indic languages)

Important Note:

For numbers larger than 4 digits, use commas (e.g., ‘10,000’ instead of ‘10000’)
This ensures proper pronunciation as a whole number

target_language_codeenumOptional

The language code in BCP-47 format.

speakerenum or nullOptional

The speaker voice to be used for the output audio.

Default: shubh (for bulbul:v3), anushka (for bulbul:v2)

Note: Speaker selection must match the chosen model version.

Important: Speaker names are case-sensitive and must be lowercase (e.g., ritu not Ritu).

pitchdouble or nullOptional-1-1

Controls the pitch of the audio. Range: -0.75 to 0.75. Default is 0.0.

Note: Only supported for bulbul:v2.

pacedouble or nullOptional0.3-3Defaults to 1

Controls the speed of the audio. Default is 1.0.

Model-specific ranges:

bulbul:v3: 0.5 to 2.0
bulbul:v2: 0.3 to 3.0

loudnessdouble or nullOptional0.1-3

Controls the loudness of the audio. Range: 0.3 to 3.0. Default is 1.0.

Note: Only supported for bulbul:v2.

speech_sample_rateenum or nullOptionalDefaults to 22050

Specifies the sample rate of the output audio. Default is 22050 Hz.

Note: OPUS codec only supports 8000, 12000, 16000, 24000, 48000 Hz.

enable_preprocessingbooleanOptionalDefaults to false

Controls whether normalization of English words and numeric entities is performed. Default is false.

modelenumOptional

Specifies the model to use for text-to-speech conversion. Default is bulbul:v2.

Allowed values:

temperaturedouble or nullOptional0.01-1Defaults to 0.6

Controls the randomness of the output. Range: 0.01 to 1.0. Default is 0.6.

Note: Only supported for bulbul:v3.

enable_cached_responsesbooleanOptionalDefaults to false

Enable caching for the request. Default is false. Currently in beta.

dict_idstring or nullOptional

The ID of a pronunciation dictionary to apply during synthesis. When provided, matching words in the input text will be replaced with their custom pronunciations before generating speech.

Create and manage dictionaries via the Pronunciation Dictionary API. Only supported by bulbul:v3.

output_audio_codecenumOptional

Specifies the codec for the streamed output audio (e.g., ‘mp3’).

output_audio_bitrateenumOptional

Bitrate for the streamed output audio. Default is '128k'.

Allowed values:

Response

Success. Returns a streamed audio response in the requested format (e.g., audio/mpeg for MP3, audio/wav for WAV).

Errors

400

Bad Request Error

403

Forbidden Error

422

Unprocessable Entity Error

429

Too Many Requests Error

500

Internal Server Error

$	curl -X POST https://api.sarvam.ai/text-to-speech/stream \
>	-H "api-subscription-key: <apiSubscriptionKey>" \
>	-H "Content-Type: application/json" \
>	-d '{
>	"text": "Hello, welcome to Sarvam AI'\''s text-to-speech service!",
>	"target_language_code": "en-IN",
>	"speaker": "shubh",
>	"pitch": 0,
>	"pace": 1.2,
>	"loudness": 1,
>	"speech_sample_rate": 22050,
>	"enable_preprocessing": false,
>	"model": "bulbul:v3",
>	"temperature": 0.6,
>	"enable_cached_responses": false,
>	"dict_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
>	"output_audio_codec": "mp3",
>	"output_audio_bitrate": "128k"
>	}'

$	curl -X POST https://api.sarvam.ai/text-to-speech/stream \
>	-H "api-subscription-key: <apiSubscriptionKey>" \
>	-H "Content-Type: application/json" \
>	-d '{
>	"text": "Hello, welcome to Sarvam AI'\''s text-to-speech service!",
>	"target_language_code": "en-IN",
>	"speaker": "shubh",
>	"pitch": 0,
>	"pace": 1.2,
>	"loudness": 1,
>	"speech_sample_rate": 22050,
>	"enable_preprocessing": false,
>	"model": "bulbul:v3",
>	"temperature": 0.6,
>	"enable_cached_responses": false,
>	"dict_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
>	"output_audio_codec": "mp3",
>	"output_audio_bitrate": "128k"
>	}'