> For clean Markdown of any page, append `.md` to the page URL.
> For a complete documentation index, see https://docs.sarvam.ai/llms.txt.
> For full documentation content in one file, see https://docs.sarvam.ai/llms-full.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.sarvam.ai/_mcp/server.

# Batch Speech-to-Text API

> Process large audio files using synchronous or asynchronous methods. Handle up to 2-hour recordings with speaker diarization, timestamps, and advanced transcription features.

<p>
  Process long audio files (up to 2 hours) using synchronous or asynchronous
  methods. Ideal for meetings, interviews, call center recordings, and
  large-scale content processing pipelines.
</p>

<ul>
  <li>
    Supports files up to 2 hours long
  </li>

  <li>
    Advanced transcription and translation
  </li>

  <li>
    Speaker diarization and timestamp support
  </li>
</ul>

<p>
  <strong>Note:</strong> You can upload up to <code>20</code> audio files per
  job.
</p>

**Model Availability:** The Batch API supports **Saaras v3** (recommended) with multiple output modes via the `mode` parameter. Legacy models **Saarika v2.5** and **Saaras v2.5** are also available but we recommend switching to **Saaras v3** for the best accuracy and features.

### Supported Modes (Saaras v3)

| Mode         | Description                                                        | Output                           |
| ------------ | ------------------------------------------------------------------ | -------------------------------- |
| `transcribe` | Standard transcription in the original language                    | Text in source language          |
| `translate`  | Transcribe and translate to English                                | English text                     |
| `verbatim`   | Word-for-word transcription including filler words and repetitions | Verbatim text in source language |
| `translit`   | Transcribe and transliterate to Roman script                       | Romanized text                   |
| `codemix`    | Transcribe code-mixed speech (e.g., Hindi-English) naturally       | Code-mixed text                  |

## Features

* Supports up to 2 hours audio
* Synchronous and asynchronous job-based API
* Upload multiple files per job

{" "}

* Indian languages and English
* Automatic language detection
* Diarization and timestamp support

- Chunk-level timestamp support
- Useful for subtitle alignment and audio navigation
- Provides start and end times for each segment of text

* Identify multiple speakers
* Output includes speaker labels (SPEAKER\_00, etc.)
* Ideal for meetings and interviews

## Input audio codec (batch)

When you call **init job** (`POST /speech-to-text/job/v1`), set `input_audio_codec` inside `job_parameters` when needed. Behavior matches the REST upload APIs: most formats are auto-detected; for **raw PCM** (`pcm_s16le`, `pcm_l16`, `pcm_raw`) you **must** set this field. PCM must be **16 kHz** sample rate. See [supported formats and MIME types](/api-reference-docs/speech-to-text/apis/overview) for the full list.

Use the same field on **speech-to-text translate** bulk jobs when translate job parameters include `input_audio_codec`.

```python
# input_audio_codec is set via the low-level init-job API
job = client.speech_to_text_job.initialise(
    job_parameters={
        "model": "saaras:v3",
        "mode": "transcribe",
        "language_code": "hi-IN",
        "input_audio_codec": "pcm_s16le",  # required for raw PCM uploads in the batch
    },
)
```

```javascript
// input_audio_codec is set via the low-level init-job API
const job = await client.speechToTextJob.initialise({
    job_parameters: {
        model: "saaras:v3",
        mode: "transcribe",
        language_code: "hi-IN",
        input_audio_codec: "pcm_s16le",
    },
});
```

## Code Examples

### Choosing a Mode

To switch between modes, simply change the `mode` parameter in your job creation call. The rest of the workflow (upload, start, wait, download) remains the same.

Transcribe audio in the original language.

```python
job = client.speech_to_text_job.create_job(
    model="saaras:v3",
    mode="transcribe",         # Standard transcription
    language_code="hi-IN",
    with_diarization=True,
    num_speakers=2,
)
```

```javascript
const job = await client.speechToTextJob.createJob({
    model: "saaras:v3",
    mode: "transcribe",         // Standard transcription
    languageCode: "hi-IN",
    withDiarization: true,
    numSpeakers: 2,
});
```

Transcribe and translate audio to English.

```python
job = client.speech_to_text_job.create_job(
    model="saaras:v3",
    mode="translate",          # Translate to English
    language_code="hi-IN",
    with_diarization=True,
    num_speakers=2,
)
```

```javascript
const job = await client.speechToTextJob.createJob({
    model: "saaras:v3",
    mode: "translate",          // Translate to English
    languageCode: "hi-IN",
    withDiarization: true,
    numSpeakers: 2,
});
```

Word-for-word transcription including filler words and repetitions.

```python
job = client.speech_to_text_job.create_job(
    model="saaras:v3",
    mode="verbatim",           # Include fillers & repetitions
    language_code="hi-IN",
    with_diarization=True,
    num_speakers=2,
)
```

```javascript
const job = await client.speechToTextJob.createJob({
    model: "saaras:v3",
    mode: "verbatim",           // Include fillers & repetitions
    languageCode: "hi-IN",
    withDiarization: true,
    numSpeakers: 2,
});
```

Transcribe and transliterate to Roman script.

```python
job = client.speech_to_text_job.create_job(
    model="saaras:v3",
    mode="translit",           # Romanized output
    language_code="hi-IN",
    with_diarization=True,
    num_speakers=2,
)
```

```javascript
const job = await client.speechToTextJob.createJob({
    model: "saaras:v3",
    mode: "translit",           // Romanized output
    languageCode: "hi-IN",
    withDiarization: true,
    numSpeakers: 2,
});
```

Transcribe code-mixed speech (e.g., Hindi-English) naturally.

```python
job = client.speech_to_text_job.create_job(
    model="saaras:v3",
    mode="codemix",            # Handle mixed-language speech
    language_code="hi-IN",
    with_diarization=True,
    num_speakers=2,
)
```

```javascript
const job = await client.speechToTextJob.createJob({
    model: "saaras:v3",
    mode: "codemix",            // Handle mixed-language speech
    languageCode: "hi-IN",
    withDiarization: true,
    numSpeakers: 2,
});
```

### REST download note

If you call **Batch - Download Results** (`POST /speech-to-text/job/v1/download-files`) directly, the body must include both `job_id` and a `files` array (for example `["0.json"]`). Output filenames come from the [Batch - Get Status](/api-reference-docs/speech-to-text/apis/batch) response (`job_details[].outputs[].file_name`). The SDK `download_outputs()` method supplies these automatically.

### Full Example

Once you've created a job with your chosen mode, the upload, processing, and download workflow is the same for all modes:

```python
from sarvamai import SarvamAI

def main():
    client = SarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")

    # Create batch job — change mode as needed
    job = client.speech_to_text_job.create_job(
        model="saaras:v3",
        mode="transcribe",
        language_code="en-IN",
        with_diarization=True,
        num_speakers=2
    )

    # Upload and process files
    audio_paths = ["path/to/audio1.mp3", "path/to/audio2.mp3"]
    job.upload_files(file_paths=audio_paths)
    job.start()

    # Wait for completion
    job.wait_until_complete()

    # Check file-level results
    file_results = job.get_file_results()

    print(f"\nSuccessful: {len(file_results['successful'])}")
    for f in file_results['successful']:
        print(f"  ✓ {f['file_name']}")

    print(f"\nFailed: {len(file_results['failed'])}")
    for f in file_results['failed']:
        print(f"  ✗ {f['file_name']}: {f['error_message']}")

    # Download outputs for successful files
    if file_results['successful']:
        job.download_outputs(output_dir="./output")
        print(f"\nDownloaded {len(file_results['successful'])} file(s) to: ./output")

if __name__ == "__main__":
    main()
```

```python
import asyncio
from sarvamai import AsyncSarvamAI

async def main():
    client = AsyncSarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")

    # Create batch job — change mode as needed
    job = await client.speech_to_text_job.create_job(
        model="saaras:v3",
        mode="transcribe",
        language_code="en-IN",
        with_diarization=True,
        num_speakers=2
    )

    # Upload and process files
    audio_paths = ["path/to/audio1.mp3", "path/to/audio2.mp3"]
    await job.upload_files(file_paths=audio_paths)
    await job.start()

    # Wait for completion
    await job.wait_until_complete()

    # Check file-level results
    file_results = await job.get_file_results()

    print(f"\nSuccessful: {len(file_results['successful'])}")
    for f in file_results['successful']:
        print(f"  ✓ {f['file_name']}")

    print(f"\nFailed: {len(file_results['failed'])}")
    for f in file_results['failed']:
        print(f"  ✗ {f['file_name']}: {f['error_message']}")

    # Download outputs for successful files
    if file_results['successful']:
        await job.download_outputs(output_dir="./output")
        print(f"\nDownloaded {len(file_results['successful'])} file(s) to: ./output")

if __name__ == "__main__":
    asyncio.run(main())
```

```javascript
import { SarvamAIClient } from "sarvamai";

async function main() {
    const client = new SarvamAIClient({
        apiSubscriptionKey: "YOUR_SARVAM_API_KEY"
    });

    // Create batch job — change mode as needed
    const job = await client.speechToTextJob.createJob({
        model: "saaras:v3",
        mode: "transcribe",
        languageCode: "en-IN",
        withDiarization: true,
        numSpeakers: 2
    });

    // Upload and process files
    const audioPaths = ["path/to/audio1.mp3", "path/to/audio2.mp3"];
    await job.uploadFiles(audioPaths);
    await job.start();

    // Wait for completion
    await job.waitUntilComplete();

    // Check file-level results
    const fileResults = await job.getFileResults();

    console.log(`\nSuccessful: ${fileResults.successful.length}`);
    for (const f of fileResults.successful) {
        console.log(`  ✓ ${f.file_name}`);
    }

    console.log(`\nFailed: ${fileResults.failed.length}`);
    for (const f of fileResults.failed) {
        console.log(`  ✗ ${f.file_name}: ${f.error_message}`);
    }

    // Download outputs for successful files
    if (fileResults.successful.length > 0) {
        await job.downloadOutputs("./output");
        console.log(`\nDownloaded ${fileResults.successful.length} file(s) to: ./output`);
    }
}

main().catch(console.error);
```

## Speaker Diarization

Speaker diarization automatically identifies and separates different speakers in an audio recording. This feature is ideal for meetings, interviews, and multi-speaker conversations where you need to know who said what.

### Capabilities

* Identify multiple speakers in a single audio file
* Assign unique speaker IDs (speaker 1, speaker 2, etc.)
* Provide timestamps for each speaker segment
* Works with up to 8 speakers per audio file

### Output Format

When `with_diarization=True` is passed in the request, the response includes a `diarized_transcript` field with detailed speaker information:

```json
{
  "request_id": "20260130_d8d2c0e6-1eb6-4982-8045-b267d5165c44",
  "transcript": "Full transcript text...",
  "timestamps": {
    "words": ["Hello, how can I help you today?", "I have a question."],
    "start_time_seconds": [0.01, 2.8],
    "end_time_seconds": [2.5, 4.2]
  },
  "diarized_transcript": {
    "entries": [
      {
        "transcript": "Hello, how can I help you today?",
        "start_time_seconds": 0.01,
        "end_time_seconds": 2.5,
        "speaker_id": "0"
      },
      {
        "transcript": "I have a question.",
        "start_time_seconds": 2.8,
        "end_time_seconds": 4.2,
        "speaker_id": "1"
      }
    ]
  },
  "language_code": "en-IN"
}
```

Each entry contains:

* `transcript`: The text spoken by the speaker
* `start_time_seconds`: When the speaker started speaking (float)
* `end_time_seconds`: When the speaker stopped speaking (float)
* `speaker_id`: Unique identifier for the speaker (e.g., "0", "1")

The SarvamAI SDK supports both synchronous and asynchronous programming in
Python. This refers to how your code interacts with the SDK, not how the
server handles the processing of requests.

## Webhook Support

For long-running batch jobs, you can use webhooks to receive notifications when jobs complete instead of polling for status updates.

### Setting Up Webhooks

When creating a job, include a `callback` parameter with your webhook URL and authentication token:

```python
from sarvamai import AsyncSarvamAI, BulkJobCallbackParams

client = AsyncSarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")

job = await client.speech_to_text_job.create_job(
    model="saaras:v3",
    mode="transcribe",
    with_diarization=True,
    callback=BulkJobCallbackParams(
        url="https://your-server.com/webhook-endpoint",
        auth_token="your-secret-token"
    )
)
```

```python
from sarvamai import SarvamAI, BulkJobCallbackParams

client = SarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")

job = client.speech_to_text_job.create_job(
    model="saaras:v3",
    mode="transcribe",
    with_diarization=True,
    callback=BulkJobCallbackParams(
        url="https://your-server.com/webhook-endpoint",
        auth_token="your-secret-token"
    )
)
```

### Webhook Payload

When a job finishes (or fails), Sarvam AI sends a **POST** to your callback URL. The body matches the same `JobStatusResponse` schema returned by [`GET /speech-to-text/job/v1/{job_id}/status`](/api-reference-docs/speech-to-text/stt/job/status) — it is a **status notification**, not the transcript text itself.

Sarvam includes your `auth_token` in the **`X-SARVAM-JOB-CALLBACK-TOKEN`** request header (validate this on every request).

```json
{
  "job_id": "job_12345",
  "job_state": "Completed",
  "created_at": "2026-05-27T10:30:00Z",
  "updated_at": "2026-05-27T10:35:12Z",
  "total_files": 2,
  "successful_files_count": 2,
  "failed_files_count": 0,
  "storage_container_type": "Azure",
  "error_message": "",
  "job_details": [
    {
      "inputs": [{ "file_name": "meeting.mp3", "file_id": "input-0" }],
      "outputs": [{ "file_name": "0.json", "file_id": "output-0" }],
      "state": "Success",
      "error_message": null,
      "exception_name": null
    }
  ]
}
```

| Field                                           | Meaning                                                                        |
| ----------------------------------------------- | ------------------------------------------------------------------------------ |
| `job_state`                                     | One of `Accepted`, `Pending`, `Running`, `Completed`, or `Failed` (title case) |
| `job_details`                                   | Per-file status: input/output file names, processing `state`, and errors       |
| `successful_files_count` / `failed_files_count` | How many files succeeded vs failed                                             |

**Transcripts are not in the webhook.** Each successful file’s transcript lives in a **downloaded JSON** (for example `0.json` listed under `job_details[].outputs`). After `job_state` is `Completed`, call [`POST /speech-to-text/job/v1/download-files`](/api-reference-docs/speech-to-text/stt/job/download) (or use `job.download_outputs()` in the SDK) and parse those files. See [Speaker Diarization → Output Format](#speaker-diarization) for the JSON shape (`transcript`, `language_code`, optional `diarized_transcript`, etc.).

### Webhook Server Example

Here's a simple FastAPI server to handle webhook callbacks:

```python
from fastapi import FastAPI, Request, HTTPException
import uvicorn

app = FastAPI()
VALID_TOKEN = "your-secret-token"

@app.post("/webhook-endpoint")
async def handle_webhook(request: Request):
    # Validate authentication
    token = request.headers.get("X-SARVAM-JOB-CALLBACK-TOKEN")
    if token != VALID_TOKEN:
        raise HTTPException(status_code=403, detail="Invalid token")
    
    # Process the webhook data
    data = await request.json()
    job_id = data.get("job_id")
    job_state = data.get("job_state")
    
    if job_state == "Completed":
        print(f"Job {job_id} completed successfully!")
        # Fetch transcripts: download output JSON files (see job_details[].outputs)
    elif job_state == "Failed":
        print(f"Job {job_id} failed: {data.get('error_message')}")
        # Handle failure
    
    return {"status": "success"}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)
```

Your webhook server must respond with a 200 status code within 30 seconds.
Make sure your webhook URL is publicly accessible and uses HTTPS in production.

## Next Steps

Select the appropriate API type based on your use case.

Sign up and get your API key from the
[dashboard](https://dashboard.sarvam.ai).

Deploy your integration and monitor usage in the dashboard.

Need help choosing the right API? Contact us on
[discord](https://discord.com/invite/5rAsykttcs) for guidance.