Speech To Text Translate
Real-Time Speech to Text Translation API
This API automatically detects the input language, transcribes the speech, and translates the text to English.
Available Options:
- Real-Time API (Current Endpoint): For quick responses under 30 seconds with immediate results
- Batch API: For longer audio files, requires following a notebook script - View Notebook
- Supports diarization (speaker identification)
Note:
- Pricing differs for Real-Time and Batch APIs
- Diarization is only available in Batch API with separate pricing
- Please refer to dashboard.sarvam.ai for detailed pricing information
Headers
Your unique subscription key for authenticating requests to the Sarvam AI Speech-to-Text API. Here are the steps to get your api key
Body
The audio file to transcribe. Supported formats are wave (.wav) and MPEG-3 (.mp3). Works best at 16kHz. Multiple channels will be merged.
Conversation context can be passed as a prompt to boost model accuracy. However, the current system is at an experimentation stage and doesn’t match the prompt performance of large language models.
Model to be used for converting speech to text in target language
saaras:v1
, saaras:v2
, saaras:turbo
, saaras:flash
Enables speaker diarization, which identifies and separates different speakers in the audio. When set to true, the API will provide speaker-specific segments in the response. Note: This parameter is currently in Beta mode.
Number of speakers to be detected in the audio. This is used when with_diarization is set to true.
Response
Transcript of the provided speech
This will return the BCP-47 code of language spoken in the input. If multiple languages are detected, this will return language code of most predominant spoken language. If no language is detected, this will be null
hi-IN
, bn-IN
, kn-IN
, ml-IN
, mr-IN
, od-IN
, pa-IN
, ta-IN
, te-IN
, gu-IN
, en-IN
Diarized transcript of the provided speech