> For clean Markdown of any page, append `.md` to the page URL.
> For a complete documentation index, see https://docs.sarvam.ai/llms.txt.
> For full documentation content in one file, see https://docs.sarvam.ai/llms-full.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.sarvam.ai/_mcp/server.

# Saaras

> Saaras v3 - Domain-aware speech translation model that converts speech directly to English text with enhanced telephony support and intelligent entity preservation.

Saaras v3 is our state-of-the-art speech recognition model with flexible output formats. It supports multiple output modes including transcription, translation, verbatim, transliteration, and code-mixed outputs. Saaras is built to make Indic languages LLM-comprehensible, offering accurate transcriptions and translations across 23 languages (22 Indian languages + English).

**Saaras v3** is the latest version with improved accuracy and performance. It is available in the **Speech-to-Text endpoint** (`/speech-to-text`) and supports multiple output modes via the `mode` parameter.

## At a Glance

|                       |                                                                                                                                                                                                                                                                         |
| --------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Model ID**          | `saaras:v3`                                                                                                                                                                                                                                                             |
| **What it does**      | Speech-to-text with five output modes: transcribe, translate, verbatim, translit, codemix                                                                                                                                                                               |
| **Languages**         | 23 (22 Indian + English), automatic language detection — [full list](#language-support)                                                                                                                                                                                 |
| **APIs**              | [REST](/api-reference-docs/api-guides-tutorials/speech-to-text/rest-api) (≤30 s), [Batch](/api-reference-docs/api-guides-tutorials/speech-to-text/batch-api) (≤2 hr/file), [WebSocket streaming](/api-reference-docs/api-guides-tutorials/speech-to-text/streaming-api) |
| **Input limits**      | 30 s per REST request; WAV, MP3, AAC, FLAC, OGG and more — [all limits](#limits)                                                                                                                                                                                        |
| **Pricing**           | [Pricing page](/api-reference-docs/pricing)                                                                                                                                                                                                                             |
| **Best for**          | Voice agents, call analytics, 8 kHz telephony audio, code-mixed speech                                                                                                                                                                                                  |
| **Known limitations** | [See below](#known-limitations)                                                                                                                                                                                                                                         |

## Output Modes

Saaras v3 supports multiple output modes via the `mode` parameter. Each mode produces different output formats for the same input audio.

**Example audio:** *"मेरा फोन नंबर है 9840950950"*

| Mode                   | Description                                                                                               | Example Output                                              |
| ---------------------- | --------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------- |
| `transcribe` (default) | Standard transcription in the original language with proper formatting and number normalization           | `मेरा फोन नंबर है 9840950950`                               |
| `translate`            | Translates speech from any supported Indic language to English                                            | `My phone number is 9840950950`                             |
| `verbatim`             | Exact word-for-word transcription without normalization, preserving filler words and spoken numbers as-is | `मेरा फोन नंबर है नौ आठ चार zero नौ पांच zero नौ पांच zero` |
| `translit`             | Romanization - Transliterates speech to Latin/Roman script                                                | `mera phone number hai 9840950950`                          |
| `codemix`              | Code-mixed text with English words in English and Indic words in native script                            | `मेरा phone number है 9840950950`                           |

## Key Features

Advanced prompting system for domain-specific translation and hotword retention, ensuring accurate context preservation.

Optimized for 8KHz telephony audio with enhanced multi-speaker recognition capabilities.

Preserves proper nouns and entities accurately across languages, maintaining context and meaning.

Supports 23 languages (22 Indian + English) with optional language identification.

Provides diarized outputs with precise timestamps for multi-speaker conversations through batch API.

Converts speech directly to English text, eliminating the need for separate transcription and translation steps.

## Language Support

Saaras v3 supports 23 languages (22 Indian languages + English) with comprehensive dialect and accent coverage, including code-mixed audio support and intelligent proper noun preservation for speech-to-English translation.

| Language  | Language Code |   | Language | Language Code |
| --------- | ------------- | - | -------- | ------------- |
| Hindi     | `hi-IN`       |   | Assamese | `as-IN`       |
| Bengali   | `bn-IN`       |   | Urdu     | `ur-IN`       |
| Kannada   | `kn-IN`       |   | Nepali   | `ne-IN`       |
| Malayalam | `ml-IN`       |   | Konkani  | `kok-IN`      |
| Marathi   | `mr-IN`       |   | Kashmiri | `ks-IN`       |
| Odia      | `od-IN`       |   | Sindhi   | `sd-IN`       |
| Punjabi   | `pa-IN`       |   | Sanskrit | `sa-IN`       |
| Tamil     | `ta-IN`       |   | Santali  | `sat-IN`      |
| Telugu    | `te-IN`       |   | Manipuri | `mni-IN`      |
| English   | `en-IN`       |   | Bodo     | `brx-IN`      |
| Gujarati  | `gu-IN`       |   | Maithili | `mai-IN`      |
|           |               |   | Dogri    | `doi-IN`      |

Language codes are optional. When not specified or set to `unknown`, the model will automatically detect the input language and return a `language_probability` score indicating detection confidence.

**Additional Capabilities:**

* Includes dialects and accents of the above languages
* Code-mixed audio support
* Intelligent Proper Noun and Entity Preservation to ensure proper nouns, regional names, and entities are recognized and retained accurately during transcription

## API Response Format

The Speech-to-Text API returns a JSON response with the following fields:

| Field                  | Type             | Description                                                                                                                                                                                                                                                                                                                                 |
| ---------------------- | ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `request_id`           | `string`         | Unique identifier for the API request.                                                                                                                                                                                                                                                                                                      |
| `transcript`           | `string`         | The transcribed text from the provided audio file.                                                                                                                                                                                                                                                                                          |
| `timestamps`           | `object or null` | Contains word-level timestamps (`start_time_seconds`, `end_time_seconds`, `words`). Only included when `with_timestamps` is set to `true`.                                                                                                                                                                                                  |
| `diarized_transcript`  | `object or null` | Diarized transcript with speaker labels. Available through batch API.                                                                                                                                                                                                                                                                       |
| `language_code`        | `string or null` | BCP-47 code of the detected language (e.g., `hi-IN`). Returns the most predominant language if multiple are detected. Returns `null` if no language is detected.                                                                                                                                                                            |
| `language_probability` | `number or null` | Float value (0.0 to 1.0) indicating the probability of the detected language being correct. Higher values indicate higher confidence. Returns a value when `language_code` is not provided or set to `unknown`. Returns `null` when a specific `language_code` is provided (language detection is skipped). Always present in the response. |

**Example Response:**

```json
{
  "request_id": "20260209_abc123-def4-5678-ghij-klmnopqrstuv",
  "transcript": "नमस्ते, आप कैसे हैं?",
  "timestamps": null,
  "diarized_transcript": null,
  "language_code": "hi-IN",
  "language_probability": 0.95
}
```

## Key Capabilities

Standard transcription in the original language with proper formatting and number normalization. This is the default mode.

```python
from sarvamai import SarvamAI

client = SarvamAI(
    api_subscription_key="YOUR_SARVAM_API_KEY"
)

response = client.speech_to_text.transcribe(
    file=open("audio.wav", "rb"),
    model="saaras:v3",
    mode="transcribe"  # default mode
)

print(response)
# Output: मेरा फोन नंबर है 9840950950
```

```javascript
import { SarvamAIClient } from "sarvamai";
import fs from "fs";

const API_KEY = "YOUR_SARVAM_API_KEY";
const FILE_PATH = "path/to/audio.wav"; // or .mp3

async function main() {
  const client = new SarvamAIClient({ apiSubscriptionKey: API_KEY });

  const response = await client.speechToText.transcribe({
    file: fs.createReadStream(FILE_PATH),
    model: "saaras:v3",
    mode: "transcribe"  // default mode
  });

  console.log(response);
}

main();
```

```bash
curl -X POST https://api.sarvam.ai/speech-to-text \
  -H "api-subscription-key: <YOUR_SARVAM_API_KEY>" \
  -H "Content-Type: multipart/form-data" \
  -F file=@"audio.wav" \
  -F model="saaras:v3" \
  -F mode="transcribe"
```

Translates speech from any supported Indic language directly to English. Perfect for making Indic content LLM-comprehensible.

```python
from sarvamai import SarvamAI

client = SarvamAI(
    api_subscription_key="YOUR_SARVAM_API_KEY"
)

response = client.speech_to_text.transcribe(
    file=open("audio.wav", "rb"),
    model="saaras:v3",
    mode="translate"
)

print(response)
# Input: "मेरा फोन नंबर है 9840950950"
# Output: My phone number is 9840950950
```

```javascript
import { SarvamAIClient } from "sarvamai";
import fs from "fs";

const API_KEY = "YOUR_SARVAM_API_KEY";
const FILE_PATH = "path/to/audio.wav";

async function main() {
  const client = new SarvamAIClient({ apiSubscriptionKey: API_KEY });

  const response = await client.speechToText.transcribe({
    file: fs.createReadStream(FILE_PATH),
    model: "saaras:v3",
    mode: "translate"
  });

  console.log(response);
}

main();
```

```bash
curl -X POST https://api.sarvam.ai/speech-to-text \
  -H "api-subscription-key: <YOUR_SARVAM_API_KEY>" \
  -H "Content-Type: multipart/form-data" \
  -F file=@"audio.wav" \
  -F model="saaras:v3" \
  -F mode="translate"
```

Exact word-for-word transcription without normalization, preserving filler words and spoken numbers as-is. Ideal for detailed analysis.

```python
from sarvamai import SarvamAI

client = SarvamAI(
    api_subscription_key="YOUR_SARVAM_API_KEY"
)

response = client.speech_to_text.transcribe(
    file=open("audio.wav", "rb"),
    model="saaras:v3",
    mode="verbatim"
)

print(response)
# Input: "मेरा फोन नंबर है 9840950950"
# Output: मेरा फोन नंबर है नौ आठ चार zero नौ पांच zero नौ पांच zero
```

```javascript
import { SarvamAIClient } from "sarvamai";
import fs from "fs";

const API_KEY = "YOUR_SARVAM_API_KEY";
const FILE_PATH = "path/to/audio.wav";

async function main() {
  const client = new SarvamAIClient({ apiSubscriptionKey: API_KEY });

  const response = await client.speechToText.transcribe({
    file: fs.createReadStream(FILE_PATH),
    model: "saaras:v3",
    mode: "verbatim"
  });

  console.log(response);
}

main();
```

```bash
curl -X POST https://api.sarvam.ai/speech-to-text \
  -H "api-subscription-key: <YOUR_SARVAM_API_KEY>" \
  -H "Content-Type: multipart/form-data" \
  -F file=@"audio.wav" \
  -F model="saaras:v3" \
  -F mode="verbatim"
```

Romanization - Transliterates speech to Latin/Roman script. Useful for search indexing or when working with systems that don't support Indic scripts.

```python
from sarvamai import SarvamAI

client = SarvamAI(
    api_subscription_key="YOUR_SARVAM_API_KEY"
)

response = client.speech_to_text.transcribe(
    file=open("audio.wav", "rb"),
    model="saaras:v3",
    mode="translit"
)

print(response)
# Input: "मेरा फोन नंबर है 9840950950"
# Output: mera phone number hai 9840950950
```

```javascript
import { SarvamAIClient } from "sarvamai";
import fs from "fs";

const API_KEY = "YOUR_SARVAM_API_KEY";
const FILE_PATH = "path/to/audio.wav";

async function main() {
  const client = new SarvamAIClient({ apiSubscriptionKey: API_KEY });

  const response = await client.speechToText.transcribe({
    file: fs.createReadStream(FILE_PATH),
    model: "saaras:v3",
    mode: "translit"
  });

  console.log(response);
}

main();
```

```bash
curl -X POST https://api.sarvam.ai/speech-to-text \
  -H "api-subscription-key: <YOUR_SARVAM_API_KEY>" \
  -H "Content-Type: multipart/form-data" \
  -F file=@"audio.wav" \
  -F model="saaras:v3" \
  -F mode="translit"
```

Code-mixed text with English words in English and Indic words in native script. Perfect for India's natural multilingual conversations.

```python
from sarvamai import SarvamAI

client = SarvamAI(
    api_subscription_key="YOUR_SARVAM_API_KEY"
)

response = client.speech_to_text.transcribe(
    file=open("audio.wav", "rb"),
    model="saaras:v3",
    mode="codemix"
)

print(response)
# Input: "मेरा फोन नंबर है 9840950950"
# Output: मेरा phone number है 9840950950
```

```javascript
import { SarvamAIClient } from "sarvamai";
import fs from "fs";

const API_KEY = "YOUR_SARVAM_API_KEY";
const FILE_PATH = "path/to/audio.wav";

async function main() {
  const client = new SarvamAIClient({ apiSubscriptionKey: API_KEY });

  const response = await client.speechToText.transcribe({
    file: fs.createReadStream(FILE_PATH),
    model: "saaras:v3",
    mode: "codemix"
  });

  console.log(response);
}

main();
```

```bash
curl -X POST https://api.sarvam.ai/speech-to-text \
  -H "api-subscription-key: <YOUR_SARVAM_API_KEY>" \
  -H "Content-Type: multipart/form-data" \
  -F file=@"audio.wav" \
  -F model="saaras:v3" \
  -F mode="codemix"
```

## Limits

| Limit                                             | Value                                                                                                           |
| ------------------------------------------------- | --------------------------------------------------------------------------------------------------------------- |
| Max audio duration (real-time REST)               | 30 seconds                                                                                                      |
| Supported formats                                 | WAV, MP3, AAC, AIFF, OGG, OPUS, FLAC, MP4, AMR, WMA, WebM (auto-detected)                                       |
| Raw PCM input (`pcm_s16le`, `pcm_l16`, `pcm_raw`) | Requires `input_audio_codec`; must be 16 kHz                                                                    |
| Longer audio                                      | Use the [Batch API](/api-reference-docs/api-guides-tutorials/speech-to-text/batch-api) (up to 2 hours per file) |
| Rate limits                                       | See [Rate Limits](/api-reference-docs/ratelimits)                                                               |

## Known Limitations

| Limitation                              | Detail                                                                                         | Workaround                                                                                                                            |
| --------------------------------------- | ---------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
| **30-second cap on real-time REST**     | The real-time `/speech-to-text` endpoint only accepts audio up to 30 seconds long              | Use the [Batch API](/api-reference-docs/api-guides-tutorials/speech-to-text/batch-api) for longer recordings (up to 2 hours per file) |
| **`mode` parameter is v3-only**         | The `mode` parameter (transcribe / translate / verbatim / codemix) works only with `saaras:v3` | Use `saaras:v3`; older versions ignore `mode`                                                                                         |
| **Translate mode outputs English only** | `mode="translate"` always produces English text, regardless of input language                  | Use transcribe mode if you need output in the original language                                                                       |

## Next Steps

Learn how to integrate Saaras v3 into your application.

Complete API documentation for Speech-to-Text endpoint.

Step-by-step tutorial for speech-to-text transcription.

***

**Deprecation Notice:** Saaras v2.5 will be deprecated soon. We recommend migrating to **Saaras v3** for improved accuracy and performance. The v2.5 model will continue to work during the transition period, but new features and improvements will only be available in v3.

### About Saaras v2.5

Saaras v2.5 is the previous speech translation model available in the **Speech-to-Text Translate endpoint** (`/speech-to-text-translate`). It converts speech directly to English text with enhanced telephony support and intelligent entity preservation.

**Key Difference:** Saaras v2.5 uses the `/speech-to-text-translate` endpoint, while Saaras v3 uses the `/speech-to-text` endpoint with mode parameter support.

### Key Features (v2.5)

Advanced prompting system for domain-specific translation and hotword retention, ensuring accurate context preservation.

Optimized for 8KHz telephony audio with enhanced multi-speaker recognition capabilities.

Preserves proper nouns and entities accurately across languages, maintaining context and meaning.

Supports 11 Indian languages with optional language identification.

Provides diarized outputs with precise timestamps for multi-speaker conversations through batch API.

Converts speech directly to English text, eliminating the need for separate transcription and translation steps.

### Translation Quality (v2.5 Benchmarks)

COMET score, a robust metric for evaluating machine speech-translations, assesses semantic accuracy, fluency, and contextual relevance. Saaras v2.5 achieves exceptional performance on the Vistaar+Indicvoices Benchmark, a dataset curated from diverse Indian language audio sources, including code-mixed content, noisy environments, and regional accents.

**COMET Score Performance:**

* **Across 11 Languages:** 89.3%
* **English:** 94.62%
* **Hindi:** 91.83%
* **9 Other languages:** 88.41%

*Higher is better; Compared on VISTAAR + IndicVoices Benchmark*

Why COMET? It evaluates not only lexical accuracy but also how well the translation captures meaning and context, critical for Indic languages with complex structures.

**Dataset Description:** Contains real-world, multi-accented speech samples that covers 10 major Indic languages, ensuring representation of India's linguistic diversity. Includes code-mixed phrases, domain-specific vocabulary, and colloquial expressions.

### v2.5 Usage Example

```python
from sarvamai import SarvamAI

client = SarvamAI(
    api_subscription_key="YOUR_SARVAM_API_KEY"
)

# Using deprecated v2.5 model
response = client.speech_to_text.translate(
    file=open("audio.wav", "rb"),
    model="saaras:v2.5"  # Deprecated - migrate to saaras:v3
)

print(response)
```

```javascript
import { SarvamAIClient } from "sarvamai";
import fs from "fs";

const API_KEY = "YOUR_SARVAM_API_KEY";
const FILE_PATH = "path/to/audio.wav";

async function main() {
  const client = new SarvamAIClient({ apiSubscriptionKey: API_KEY });

  // Using deprecated v2.5 model
  const response = await client.speechToText.translate({
    file: fs.createReadStream(FILE_PATH),
    model: "saaras:v2.5"  // Deprecated - migrate to saaras:v3 with mode="translate"
  });

  console.log(response);
}

main();
```

```bash
# Using deprecated v2.5 model
curl -X POST https://api.sarvam.ai/speech-to-text-translate \
  -H "api-subscription-key: <YOUR_SARVAM_API_KEY>" \
  -H "Content-Type: multipart/form-data" \
  -F file=@"audio.wav" \
  -F model="saaras:v2.5"
```

### Migration Guide

To migrate from Saaras v2.5 to v3:

1. **Change the endpoint:** Switch from `/speech-to-text-translate` to `/speech-to-text`
2. **Update the model parameter:** Change from `saaras:v2.5` to `saaras:v3`
3. **Add the mode parameter:** Use `mode="translate"` to get English output (similar to v2.5 behavior)

```diff
# Endpoint change
- POST /speech-to-text-translate
+ POST /speech-to-text

# Parameter changes
- model="saaras:v2.5"
+ model="saaras:v3"
+ mode="translate"
```

**SDK Migration:**

```diff
# Python
- response = client.speech_to_text.translate(
-     file=open("audio.wav", "rb"), model="saaras:v2.5"
- )
+ response = client.speech_to_text.transcribe(
+     file=open("audio.wav", "rb"), model="saaras:v3", mode="translate"
+ )

# JavaScript
- const response = await client.speechToText.translate({ file: fs.createReadStream("audio.wav"), model: "saaras:v2.5" });
+ const response = await client.speechToText.transcribe({ file: fs.createReadStream("audio.wav"), model: "saaras:v3", mode: "translate" });
```

The response format remains compatible.