Speech to Text
Real-Time Speech to Text API
This API transcribes speech to text in multiple Indian languages and English. Supports real-time transcription for interactive applications.
Available Options:
- Real-Time API (Current Endpoint): For quick responses under 30 seconds with immediate results
- Batch API: For longer audio files, requires following a notebook script - View Notebook
- Supports diarization (speaker identification)
Note:
- Pricing differs for Real-Time and Batch APIs
- Diarization is only available in Batch API with separate pricing
- Please refer to dashboard.sarvam.ai for detailed pricing information
Headers
Your unique subscription key for authenticating requests to the Sarvam AI Speech-to-Text API. Here are the steps to get your api key
Body
The audio file to transcribe. Supported formats are WAV
(.wav) and MP3
(.mp3).
The API works best with audio files sampled at 16kHz. If the audio contains multiple channels, they will be merged into a single channel.
Specifies the model to use for speech-to-text conversion.
Note:- Default model is saarika:v2
saarika:v1
, saarika:v2
, saarika:flash
Specifies the language of the input audio. This parameter is required to ensure accurate transcription.
For the saarika:v1
model, this parameter is mandatory.
For the saarika:v2
model, it is optional.
unknown
: Use this when the language is not known; the API will detect it automatically.
Note:- that the saarika:v1
model does not support unknown
language code.
unknown
, hi-IN
, bn-IN
, kn-IN
, ml-IN
, mr-IN
, od-IN
, pa-IN
, ta-IN
, te-IN
, en-IN
, gu-IN
Enables timestamps in the response. If set to true
, the response will include timestamps in the transcript.
Enables speaker diarization, which identifies and separates different speakers in the audio. When set to true, the API will provide speaker-specific segments in the response. Note: This parameter is currently in Beta mode.
Number of speakers to be detected in the audio. This is used when with_diarization is set to true.
Response
The transcribed text from the provided audio file.
"नमस्ते, आप कैसे हैं?"
Contains timestamps for the transcribed text. This field is included only if with_timestamps is set to true
{
"timestamps": {
"end_time_seconds": [16.27],
"start_time_seconds": [0],
"words": [
"Good afternoon, this is Naveen from Sarvam."
]
}
}
Diarized transcript of the provided speech
This will return the BCP-47 code of language spoken in the input. If multiple languages are detected, this will return language code of most predominant spoken language. If no language is detected, this will be null