POST
/
speech-to-text-translate
curl --request POST \
  --url https://api.sarvam.ai/speech-to-text-translate \
  --header 'Content-Type: multipart/form-data' \
  --form 'prompt=<string>' \
  --form model=saaras:v2 \
  --form with_diarization=false \
  --form num_speakers=123
{
  "request_id": "<string>",
  "transcript": "<string>",
  "language_code": "hi-IN",
  "diarized_transcript": {
    "entries": [
      {
        "transcript": "<string>",
        "start_time_seconds": 123,
        "end_time_seconds": 123,
        "speaker_id": "<string>"
      }
    ]
  }
}

Headers

api-subscription-key
string
default:

Your unique subscription key for authenticating requests to the Sarvam AI Speech-to-Text API. Here are the steps to get your api key

Body

multipart/form-data
file
file
required

The audio file to transcribe. Supported formats are wave (.wav) and MPEG-3 (.mp3). Works best at 16kHz. Multiple channels will be merged.

prompt
string | null

Conversation context can be passed as a prompt to boost model accuracy. However, the current system is at an experimentation stage and doesn’t match the prompt performance of large language models.

model
enum<string>

Model to be used for converting speech to text in target language

Available options:
saaras:v1,
saaras:v2,
saaras:turbo,
saaras:flash
with_diarization
boolean
default:
false

Enables speaker diarization, which identifies and separates different speakers in the audio. When set to true, the API will provide speaker-specific segments in the response. Note: This parameter is currently in Beta mode.

num_speakers
integer | null

Number of speakers to be detected in the audio. This is used when with_diarization is set to true.

Response

200
application/json
Successful Response
request_id
string | null
required
transcript
string
required

Transcript of the provided speech

language_code
enum<string> | null
required

This will return the BCP-47 code of language spoken in the input. If multiple languages are detected, this will return language code of most predominant spoken language. If no language is detected, this will be null

Available options:
hi-IN,
bn-IN,
kn-IN,
ml-IN,
mr-IN,
od-IN,
pa-IN,
ta-IN,
te-IN,
gu-IN,
en-IN
diarized_transcript
object | null

Diarized transcript of the provided speech