Batch Speech-to-Text API
Batch Speech-to-Text API
Batch Speech-to-Text API
Process long audio files (up to 1 hour) using synchronous or asynchronous methods. Ideal for meetings, interviews, call center recordings, and large-scale content processing pipelines.
Note: You can upload up to 20 audio files per
job.
Model Availability: The Batch API supports Saaras v3 (recommended) with multiple output modes via the mode parameter. Legacy models Saarika v2.5 and Saaras v2.5 are also available but we recommend switching to Saaras v3 for the best accuracy and features.
When you call init job (POST /speech-to-text/job/v1), set input_audio_codec inside job_parameters when needed. Behavior matches the REST upload APIs: most formats are auto-detected; for raw PCM (pcm_s16le, pcm_l16, pcm_raw) you must set this field. PCM must be 16 kHz sample rate. See supported formats and MIME types for the full list.
Use the same field on speech-to-text translate bulk jobs when translate job parameters include input_audio_codec.
To switch between modes, simply change the mode parameter in your job creation call. The rest of the workflow (upload, start, wait, download) remains the same.
Transcribe audio in the original language.
If you call Batch - Download Results (POST /speech-to-text/job/v1/download-files) directly, the body must include both job_id and a files array (for example ["0.json"]). Output filenames come from the Batch - Get Status response (job_details[].outputs[].file_name). The SDK download_outputs() method supplies these automatically.
Once you’ve created a job with your chosen mode, the upload, processing, and download workflow is the same for all modes:
Speaker diarization automatically identifies and separates different speakers in an audio recording. This feature is ideal for meetings, interviews, and multi-speaker conversations where you need to know who said what.
When with_diarization=True is passed in the request, the response includes a diarized_transcript field with detailed speaker information:
Each entry contains:
transcript: The text spoken by the speakerstart_time_seconds: When the speaker started speaking (float)end_time_seconds: When the speaker stopped speaking (float)speaker_id: Unique identifier for the speaker (e.g., “0”, “1”)The SarvamAI SDK supports both synchronous and asynchronous programming in Python. This refers to how your code interacts with the SDK, not how the server handles the processing of requests.
For long-running batch jobs, you can use webhooks to receive notifications when jobs complete instead of polling for status updates.
When creating a job, include a callback parameter with your webhook URL and authentication token:
When a job finishes (or fails), Sarvam AI sends a POST to your callback URL. The body matches the same JobStatusResponse schema returned by GET /speech-to-text/job/v1/{job_id}/status — it is a status notification, not the transcript text itself.
Sarvam includes your auth_token in the X-SARVAM-JOB-CALLBACK-TOKEN request header (validate this on every request).
Transcripts are not in the webhook. Each successful file’s transcript lives in a downloaded JSON (for example 0.json listed under job_details[].outputs). After job_state is Completed, call POST /speech-to-text/job/v1/download-files (or use job.download_outputs() in the SDK) and parse those files. See Speaker Diarization → Output Format for the JSON shape (transcript, language_code, optional diarized_transcript, etc.).
Here’s a simple FastAPI server to handle webhook callbacks:
Your webhook server must respond with a 200 status code within 30 seconds. Make sure your webhook URL is publicly accessible and uses HTTPS in production.
Need help choosing the right API? Contact us on discord for guidance.