Batch Speech-to-Text API

Process long audio files (up to 2 hours) using synchronous or asynchronous methods. Ideal for meetings, interviews, call center recordings, and large-scale content processing pipelines.

Supports files up to 2 hours long
Advanced transcription and translation
Speaker diarization and timestamp support

Note: You can upload up to 20 audio files per job.

Model Availability: The Batch API supports Saaras v3 with multiple output modes via the mode parameter.

Supported Modes (Saaras v3)

Mode	Description	Output
`transcribe`	Standard transcription in the original language	Text in source language
`translate`	Transcribe and translate to English	English text
`verbatim`	Word-for-word transcription including filler words and repetitions	Verbatim text in source language
`translit`	Transcribe and transliterate to Roman script	Romanized text
`codemix`	Transcribe code-mixed speech (e.g., Hindi-English) naturally	Code-mixed text

Features

Processing

Supports up to 2 hours audio
Synchronous and asynchronous job-based API
Upload multiple files per job

Audio & Language Support

Indian languages and English
Automatic language detection
Diarization and timestamp support

Timestamps

Chunk-level timestamps only (not word-level)
Each chunk covers a sentence or phrase segment
Provides start and end times for each chunk of text
Useful for subtitle alignment and audio navigation

Speaker Diarization

Identify multiple speakers
Output includes speaker labels (SPEAKER_00, etc.)
Ideal for meetings and interviews

Word-level timestamps are not supported. The Batch API returns chunk-level timestamps only — each timestamp entry covers a sentence or phrase, not an individual word. If your use case requires per-word timing, this is not currently available.

Input audio codec (batch)

When you call init job (POST /speech-to-text/job/v1), set input_audio_codec inside job_parameters when needed. Behavior matches the REST upload APIs: most formats are auto-detected; for raw PCM (pcm_s16le, pcm_l16, pcm_raw) you must set this field. PCM must be 16 kHz sample rate. See supported formats and MIME types for the full list.

Use the same field on speech-to-text translate bulk jobs when translate job parameters include input_audio_codec.

1 # input_audio_codec is set via the low-level init-job API
2 job = client.speech_to_text_job.initialise(
3     job_parameters={
4         "model": "saaras:v3",
5         "mode": "transcribe",
6         "language_code": "hi-IN",
7         "input_audio_codec": "pcm_s16le",  # required for raw PCM uploads in the batch
8     },
9 )

Code Examples

Choosing a Mode

To switch between modes, simply change the mode parameter in your job creation call. The rest of the workflow (upload, start, wait, download) remains the same.

Transcribe

Translate

Verbatim

Translit

Codemix

Transcribe audio in the original language.

1 job = client.speech_to_text_job.create_job(
2     model="saaras:v3",
3     mode="transcribe",         # Standard transcription
4     language_code="hi-IN",
5     with_diarization=True,
6     num_speakers=2,
7 )

REST download note

If you call Batch - Download Results (POST /speech-to-text/job/v1/download-files) directly, the body must include both job_id and a files array (for example ["0.json"]). Output filenames come from the Batch - Get Status response (job_details[].outputs[].file_name). The SDK download_outputs() method supplies these automatically.

Full Example

Once you’ve created a job with your chosen mode, the upload, processing, and download workflow is the same for all modes:

1 from sarvamai import SarvamAI
2 
3 def main():
4     client = SarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")
5 
6     # Create batch job — change mode as needed
7     job = client.speech_to_text_job.create_job(
8         model="saaras:v3",
9         mode="transcribe",
10         language_code="en-IN",
11         with_diarization=True,
12         num_speakers=2
13     )
14 
15     # Upload and process files
16     audio_paths = ["path/to/audio1.mp3", "path/to/audio2.mp3"]
17     job.upload_files(file_paths=audio_paths)
18     job.start()
19 
20     # Wait for completion
21     job.wait_until_complete()
22 
23     # Check file-level results
24     file_results = job.get_file_results()
25 
26     print(f"\nSuccessful: {len(file_results['successful'])}")
27     for f in file_results['successful']:
28         print(f"  ✓ {f['file_name']}")
29 
30     print(f"\nFailed: {len(file_results['failed'])}")
31     for f in file_results['failed']:
32         print(f"  ✗ {f['file_name']}: {f['error_message']}")
33 
34     # Download outputs for successful files
35     if file_results['successful']:
36         job.download_outputs(output_dir="./output")
37         print(f"\nDownloaded {len(file_results['successful'])} file(s) to: ./output")
38 
39 if __name__ == "__main__":
40     main()

Speaker Diarization

Speaker diarization automatically identifies and separates different speakers in an audio recording. This feature is ideal for meetings, interviews, and multi-speaker conversations where you need to know who said what.

Capabilities

Identify multiple speakers in a single audio file
Assign unique speaker IDs (speaker 1, speaker 2, etc.)
Provide timestamps for each speaker segment
Works with up to 8 speakers per audio file

Output Format

When with_diarization=True is passed in the request, the response includes a diarized_transcript field with detailed speaker information:

1 {
2   "request_id": "20260130_d8d2c0e6-1eb6-4982-8045-b267d5165c44",
3   "transcript": "Full transcript text...",
4   "timestamps": {
5     "chunks": ["Hello, how can I help you today?", "I have a question."],
6     "start_time_seconds": [0.01, 2.8],
7     "end_time_seconds": [2.5, 4.2]
8   },
9   "diarized_transcript": {
10     "entries": [
11       {
12         "transcript": "Hello, how can I help you today?",
13         "start_time_seconds": 0.01,
14         "end_time_seconds": 2.5,
15         "speaker_id": "0"
16       },
17       {
18         "transcript": "I have a question.",
19         "start_time_seconds": 2.8,
20         "end_time_seconds": 4.2,
21         "speaker_id": "1"
22       }
23     ]
24   },
25   "language_code": "en-IN"
26 }

Each entry contains:

transcript: The text spoken by the speaker
start_time_seconds: When the speaker started speaking (float)
end_time_seconds: When the speaker stopped speaking (float)
speaker_id: Unique identifier for the speaker (e.g., “0”, “1”)

The SarvamAI SDK supports both synchronous and asynchronous programming in Python. This refers to how your code interacts with the SDK, not how the server handles the processing of requests.

Polling Defaults

job.wait_until_complete() polls GET /speech-to-text/job/v1/{job_id}/status in a loop until the job reaches a terminal state:

1 def wait_until_complete(self, poll_interval: int = 5, timeout: int = 600) -> JobStatusResponse

Parameter	Default	Meaning
`poll_interval`	5 seconds	Time between status checks
`timeout`	600 seconds (10 minutes)	Raises `TimeoutError` if the job hasn’t reached `Completed`/`Failed` by then

Both are overridable, for example job.wait_until_complete(poll_interval=10, timeout=1800), if your files are large enough that 10 minutes isn’t enough headroom.

Document Digitization’s wait_until_complete() uses different defaults (2-second poll interval, no timeout by default). See Document Digitization → Polling Defaults.

Webhook Support

For long-running batch jobs, you can use webhooks to receive notifications when jobs complete instead of polling for status updates.

Setting Up Webhooks

When creating a job, include a callback parameter with your webhook URL and authentication token:

1 from sarvamai import AsyncSarvamAI, BulkJobCallbackParams
2 
3 client = AsyncSarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")
4 
5 job = await client.speech_to_text_job.create_job(
6     model="saaras:v3",
7     mode="transcribe",
8     with_diarization=True,
9     callback=BulkJobCallbackParams(
10         url="https://your-server.com/webhook-endpoint",
11         auth_token="your-secret-token"
12     )
13 )

Webhook Payload

When a job finishes (or fails), Sarvam AI sends a POST to your callback URL. The body matches the same JobStatusResponse schema returned by GET /speech-to-text/job/v1/{job_id}/status — it is a status notification, not the transcript text itself.

Sarvam includes your auth_token in the X-SARVAM-JOB-CALLBACK-TOKEN request header (validate this on every request).

1 {
2   "job_id": "job_12345",
3   "job_state": "Completed",
4   "created_at": "2026-05-27T10:30:00Z",
5   "updated_at": "2026-05-27T10:35:12Z",
6   "total_files": 2,
7   "successful_files_count": 2,
8   "failed_files_count": 0,
9   "storage_container_type": "Azure",
10   "error_message": "",
11   "job_details": [
12     {
13       "inputs": [{ "file_name": "meeting.mp3", "file_id": "input-0" }],
14       "outputs": [{ "file_name": "0.json", "file_id": "output-0" }],
15       "state": "Success",
16       "error_message": null,
17       "exception_name": null
18     }
19   ]
20 }

Field	Meaning
`job_state`	One of `Accepted`, `Pending`, `Running`, `Completed`, or `Failed` (title case)
`job_details`	Per-file status: input/output file names, processing `state`, and errors
`successful_files_count` / `failed_files_count`	How many files succeeded vs failed

Transcripts are not in the webhook. Each successful file’s transcript lives in a downloaded JSON (for example 0.json listed under job_details[].outputs). After job_state is Completed, call POST /speech-to-text/job/v1/download-files (or use job.download_outputs() in the SDK) and parse those files. See Speaker Diarization → Output Format for the JSON shape (transcript, language_code, optional diarized_transcript, etc.).

Webhook Server Example

Here’s a simple FastAPI server to handle webhook callbacks:

1 from fastapi import FastAPI, Request, HTTPException
2 import uvicorn
3 
4 app = FastAPI()
5 VALID_TOKEN = "your-secret-token"
6 
7 @app.post("/webhook-endpoint")
8 async def handle_webhook(request: Request):
9     # Validate authentication
10     token = request.headers.get("X-SARVAM-JOB-CALLBACK-TOKEN")
11     if token != VALID_TOKEN:
12         raise HTTPException(status_code=403, detail="Invalid token")
13     
14     # Process the webhook data
15     data = await request.json()
16     job_id = data.get("job_id")
17     job_state = data.get("job_state")
18     
19     if job_state == "Completed":
20         print(f"Job {job_id} completed successfully!")
21         # Fetch transcripts: download output JSON files (see job_details[].outputs)
22     elif job_state == "Failed":
23         print(f"Job {job_id} failed: {data.get('error_message')}")
24         # Handle failure
25     
26     return {"status": "success"}
27 
28 if __name__ == "__main__":
29     uvicorn.run(app, host="0.0.0.0", port=8000)

Your webhook server must respond with a 200 status code within 30 seconds. Make sure your webhook URL is publicly accessible and uses HTTPS in production.

Next Steps

Choose Your API

Select the appropriate API type based on your use case.

Get API Key

Go Live

Deploy your integration and monitor usage in the dashboard.

Need help choosing the right API? Contact us on discord for guidance.