REST | Sarvam API Docs

Speech to Text API

This API transcribes speech to text in multiple Indian languages and English. Supports transcription for interactive applications.

Available Options:

REST API (Current Endpoint): For quick responses under 30 seconds with immediate results
Batch API: For longer audio files, Follow This Documentation
- Supports diarization (speaker identification)

Note:

Pricing differs for REST and Batch APIs
Diarization is only available in Batch API with separate pricing
Please refer to here for detailed pricing information

## Speech to Text API This API transcribes speech to text in multiple Indian languages and English. Supports transcription for interactive applications. ### Available Options: - **REST API** (Current Endpoint): For quick responses under 30 seconds with immediate results - **Batch API**: For longer audio files, [Follow This Documentation](https://docs.sarvam.ai/api-reference-docs/api-guides-tutorials/speech-to-text/batch-api) - Supports diarization (speaker identification) ### Note: - Pricing differs for REST and Batch APIs - Diarization is only available in Batch API with separate pricing - Please refer to [here](https://docs.sarvam.ai/api-reference-docs/pricing) for detailed pricing information

Authentication

api-subscription-keystring

API Key authentication via header

Request

This endpoint expects a multipart form containing a file.

filefileRequired

The audio file to transcribe. Supported formats include WAV, MP3, AAC, AIFF, OGG, OPUS, FLAC, MP4/M4A, AMR, WMA, WebM, and PCM formats. The API automatically detects most codec formats, but for PCM files (pcm_s16le, pcm_l16, pcm_raw), you must specify the input_audio_codec parameter. PCM files are supported only at 16kHz sample rate. The API works best with audio files sampled at 16kHz. If the audio contains multiple channels, they will be merged into a single channel.

modelenumOptional

Specifies the model to use for speech-to-text conversion.

saaras:v3 (default, recommended): State-of-the-art model with flexible output formats. Supports multiple modes via the mode parameter: transcribe, translate, verbatim, translit, codemix.
saarika:v2.5 (legacy): Transcribes audio in the spoken language. Kept for backward compatibility.

Specifies the model to use for speech-to-text conversion. - **saaras:v3** (default, recommended): State-of-the-art model with flexible output formats. Supports multiple modes via the `mode` parameter: transcribe, translate, verbatim, translit, codemix. - **saarika:v2.5** (legacy): Transcribes audio in the spoken language. Kept for backward compatibility.

Allowed values:

modeenum or nullOptionalDefaults to transcribe

Mode of operation. Only applicable when using saaras:v3 model.

Example audio: ‘मेरा फोन नंबर है 9840950950’

transcribe (default): Standard transcription in the original language with proper formatting and number normalization.
- Output: मेरा फोन नंबर है 9840950950
translate: Translates speech from any supported Indic language to English.
- Output: My phone number is 9840950950
verbatim: Exact word-for-word transcription without normalization, preserving filler words and spoken numbers as-is.
- Output: मेरा फोन नंबर है नौ आठ चार zero नौ पांच zero नौ पांच zero
translit: Romanization - Transliterates speech to Latin/Roman script only.
- Output: mera phone number hai 9840950950
codemix: Code-mixed text with English words in English and Indic words in native script.
- Output: मेरा phone number है 9840950950

Mode of operation. **Only applicable when using saaras:v3 model.** Example audio: 'मेरा फोन नंबर है 9840950950' - **transcribe** (default): Standard transcription in the original language with proper formatting and number normalization. - Output: `मेरा फोन नंबर है 9840950950` - **translate**: Translates speech from any supported Indic language to English. - Output: `My phone number is 9840950950` - **verbatim**: Exact word-for-word transcription without normalization, preserving filler words and spoken numbers as-is. - Output: `मेरा फोन नंबर है नौ आठ चार zero नौ पांच zero नौ पांच zero` - **translit**: Romanization - Transliterates speech to Latin/Roman script only. - Output: `mera phone number hai 9840950950` - **codemix**: Code-mixed text with English words in English and Indic words in native script. - Output: `मेरा phone number है 9840950950`

Allowed values:

language_codeenumOptional

Specifies the language of the input audio in BCP-47 format.

Note: This parameter is optional for saarika:v2.5 model.

Available Options:

unknown: Use when the language is not known; the API will auto-detect.
hi-IN: Hindi
bn-IN: Bengali
kn-IN: Kannada
ml-IN: Malayalam
mr-IN: Marathi
od-IN: Odia
pa-IN: Punjabi
ta-IN: Tamil
te-IN: Telugu
en-IN: English
gu-IN: Gujarati

Additional Options (saaras:v3 only):

as-IN: Assamese
ur-IN: Urdu
ne-IN: Nepali
kok-IN: Konkani
ks-IN: Kashmiri
sd-IN: Sindhi
sa-IN: Sanskrit
sat-IN: Santali
mni-IN: Manipuri
brx-IN: Bodo
mai-IN: Maithili
doi-IN: Dogri

Specifies the language of the input audio in BCP-47 format. **Note:** This parameter is optional for `saarika:v2.5` model. **Available Options:** - `unknown`: Use when the language is not known; the API will auto-detect. - `hi-IN`: Hindi - `bn-IN`: Bengali - `kn-IN`: Kannada - `ml-IN`: Malayalam - `mr-IN`: Marathi - `od-IN`: Odia - `pa-IN`: Punjabi - `ta-IN`: Tamil - `te-IN`: Telugu - `en-IN`: English - `gu-IN`: Gujarati **Additional Options (saaras:v3 only):** - `as-IN`: Assamese - `ur-IN`: Urdu - `ne-IN`: Nepali - `kok-IN`: Konkani - `ks-IN`: Kashmiri - `sd-IN`: Sindhi - `sa-IN`: Sanskrit - `sat-IN`: Santali - `mni-IN`: Manipuri - `brx-IN`: Bodo - `mai-IN`: Maithili - `doi-IN`: Dogri

input_audio_codecenumOptional

Input Audio codec/format of the input file. PCM files are supported only at 16kHz sample rate.

Response

Successful Response

request_idstring or null

transcriptstring

The transcribed text from the provided audio file.

language_codestring or null

This will return the BCP-47 code of language spoken in the input. If multiple languages are detected, this will return language code of most predominant spoken language. If no language is detected, this will be null

timestampsobject or null

Contains timestamps for the transcribed text. This field is included only if with_timestamps is set to true

diarized_transcriptobject or null

Diarized transcript of the provided speech

language_probabilitydouble or null

Float value (0.0 to 1.0) indicating the probability of the detected language being correct. Higher values indicate higher confidence.

When it returns a value:

When language_code is not provided in the request
When language_code is set to unknown

When it returns null:

When a specific language_code is provided (language detection is skipped)

The parameter is always present in the response.

Errors

400

Bad Request Error

403

Forbidden Error

422

Unprocessable Entity Error

429

Too Many Requests Error

500

Internal Server Error

503

Service Unavailable Error

$	curl -X POST https://api.sarvam.ai/speech-to-text \
>	-H "api-subscription-key: <apiSubscriptionKey>" \
>	-H "Content-Type: multipart/form-data" \
>	-F file=@interview_sample.wav

1	{
2	"request_id": "20240615_1a2b3c4d-5678-90ab-cdef-1234567890ab",
3	"transcript": "नमस्ते, आप कैसे हैं?",
4	"language_code": "hi-IN",
5	"timestamps": {
6	"words": [
7	"नमस्ते",
8	"आप",
9	"कैसे",
10	"हैं?"
11	],
12	"start_time_seconds": [
13	0,
14	0.5,
15	1,
16	1.5
17	],
18	"end_time_seconds": [
19	0.5,
20	1,
21	1.5,
22	2
23	],
24	"timestamps": {
25	"end_time_seconds": [
26	2
27	],
28	"start_time_seconds": [
29	0
30	],
31	"words": [
32	"नमस्ते, आप कैसे हैं?"
33	]
34	}
35	},
36	"diarized_transcript": {
37	"entries": [
38	{
39	"transcript": "नमस्ते, आप कैसे हैं?",
40	"start_time_seconds": 0,
41	"end_time_seconds": 2,
42	"speaker_id": "speaker_1"
43	}
44	]
45	},
46	"language_probability": 0.95
47	}