REST | Sarvam API Docs

Speech to Text Translation API

This API automatically detects the input language, transcribes the speech, and translates the text to English.

Available Options:

REST API (Current Endpoint): For quick responses under 30 seconds with immediate results
Batch API: For longer audio files Follow this documentation
- Supports diarization (speaker identification)

Note:

Pricing differs for REST and Batch APIs
Diarization is only available in Batch API with separate pricing
Please refer to here for detailed pricing information

## Speech to Text Translation API This API automatically detects the input language, transcribes the speech, and translates the text to English. ### Available Options: - **REST API** (Current Endpoint): For quick responses under 30 seconds with immediate results - **Batch API**: For longer audio files [Follow this documentation](https://docs.sarvam.ai/api-reference-docs/api-guides-tutorials/speech-to-text/batch-api) - Supports diarization (speaker identification) ### Note: - Pricing differs for REST and Batch APIs - Diarization is only available in Batch API with separate pricing - Please refer to [here](https://docs.sarvam.ai/api-reference-docs/getting-started/pricing) for detailed pricing information

Authentication

api-subscription-keystring

API Key authentication via header

Request

This endpoint expects a multipart form containing a file.

filefileRequired

The audio file to transcribe. Supported formats include WAV, MP3, AAC, AIFF, OGG, OPUS, FLAC, MP4/M4A, AMR, WMA, WebM, and PCM formats. The API automatically detects most codec formats, but for PCM files (pcm_s16le, pcm_l16, pcm_raw), you must specify the input_audio_codec parameter. PCM files are supported only at 16kHz sample rate. Works best at 16kHz. Multiple channels will be merged.

promptstring or nullOptional

Conversation context can be passed as a prompt to boost model accuracy. However, the current system is at an experimentation stage and doesn't match the prompt performance of large language models.

modelenumOptional

Model to be used for converting speech to text in target language

Allowed values:

input_audio_codecenumOptional

Audio codec/format of the input file. Our API automatically detects all codec formats, but for PCM files specifically (pcm_s16le, pcm_l16, pcm_raw), you must pass this parameter. PCM files are supported only at 16kHz sample rate.

Response

Successful Response

request_idstring or null

transcriptstring

Transcript of the provided speech

language_codeenum or null

This will return the BCP-47 code of language spoken in the input. If multiple languages are detected, this will return language code of most predominant spoken language. If no language is detected, this will be null

diarized_transcriptobject or null

Diarized transcript of the provided speech

1	from sarvamai import SarvamAI
2
3	client = SarvamAI(
4	api_subscription_key="YOUR_API_SUBSCRIPTION_KEY",
5	)
6	client.speech_to_text.translate()

1	{
2	"request_id": "string",
3	"transcript": "string",
4	"language_code": "hi-IN",
5	"diarized_transcript": {
6	"entries": [
7	{
8	"transcript": "string",
9	"start_time_seconds": 1.1,
10	"end_time_seconds": 1.1,
11	"speaker_id": "string"
12	}
13	]
14	}
15	}

Speech to Text Translation API

Available Options:

Note:

Authentication

Request

Response

Errors

Speech to Text Translation API

Available Options:

Note: