Frequently Asked Questions

Find answers to common questions about our speech-to-text services

General Questions

What audio formats are supported?

REST and Batch APIs support a wide range of audio formats including:

WAV
MP3
M4A
AAC
OGG
FLAC
WebM
PCM (pcm_s16le, pcm_l16, pcm_raw)

WebSocket/Streaming APIs only support:

WAV
Raw PCM (pcm_s16le, pcm_l16, pcm_raw)

For optimal results, we recommend:

Sample rate: 16kHz or higher
Bit depth: 16-bit
Channels: Mono or Stereo

What languages are supported?

Our models support multiple Indian and global languages:

Indian Languages

Hindi
English (Indian)
Bengali
Tamil
Telugu
Kannada
Malayalam
Marathi
Gujarati
Punjabi

Global Languages

English (US, UK, AU)
French
German
Spanish
Japanese

Check our models page for the complete list and specific model capabilities.

What is the maximum duration?

The limits vary by API endpoint:

REST API

Maximum duration: 30 seconds per request

Batch API

Maximum duration: 2 hours per file
Maximum files per job: 20

WebSocket API (Streaming)

Continuous streaming with chunked audio — no duration limit
Concurrency limits apply per plan (see Rate Limits)

For audio longer than 30 seconds, use the Batch API. For files longer than 2 hours, we recommend:

Splitting into smaller segments
Contacting support for custom solutions

How accurate is the transcription?

Accuracy varies based on several factors:

Typical Accuracy Rates

Clear speech, minimal background noise: 95-98%
Multiple speakers, moderate noise: 90-95%
Heavy accent or background noise: 85-90%

Factors affecting accuracy:

Audio quality
Background noise
Speaker accent
Speaking speed
Domain-specific terminology

Use our interactive API reference to test with your specific audio.

Technical Questions

How does speaker diarization work?

Speaker diarization identifies and labels different speakers in the audio:

Process:
- Voice activity detection
- Speaker segmentation
- Speaker clustering
- Speaker labeling
Usage (via Batch API):

1 from sarvamai import SarvamAI
2 
3 client = SarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")
4 
5 # Speaker diarization is available through the Batch API
6 # See: https://docs.sarvam.ai/api-reference-docs/speech-to-text/batch
7 job = client.speech_to_text_job.create_job(
8     model="saaras:v3",
9     mode="transcribe",
10     with_diarization=True,
11 )
12 job.upload_files(file_paths=["audio.mp3"])
13 job.start()
14 job.wait_until_complete()
15 job.download_outputs(output_dir="./output")

Output:

1 {
2   "segments": [
3     {
4       "speaker": "Speaker 1",
5       "text": "Hello, how are you?",
6       "start": 0.0,
7       "end": 1.5
8     },
9     {
10       "speaker": "Speaker 2",
11       "text": "I'm doing well, thanks!",
12       "start": 1.8,
13       "end": 3.2
14     }
15   ]
16 }

What are the rate limits?

Rate limits are applied per account based on your subscription plan:

Plan	Rate Limit
Starter	60 requests/min
Pro	200 requests/min
Business	1,000 requests/min
Enterprise	Custom limits

Duration Limits

REST API: Max 30 seconds of audio per request
Batch API: Up to 2 hours per file, 20 files per job
Streaming API: Continuous (chunked) streaming; concurrency limits per plan

For batch endpoints, implement a minimum 5ms delay between status polling requests.

View the full Credits & Rate Limits page for details on HTTP headers, error handling, and upgrade paths.

How do I handle errors?

Common errors and solutions:

1. Authentication Errors (403)

1 {
2   "error": {
3     "code": "invalid_api_key_error",
4     "message": "API key is invalid or expired"
5   }
6 }

Solution: Check API key validity and proper configuration. Note: Sarvam returns HTTP 403 (not 401) for invalid/missing API keys — see the Authentication page.

2. Rate Limit / Quota Errors (429)

1 {
2   "error": {
3     "code": "insufficient_quota_error",
4     "message": "API quota exceeded"
5   }
6 }

Solution: Implement exponential backoff or upgrade plan. A 429 with rate_limit_exceeded_error means too many requests; insufficient_quota_error means credits are exhausted — see Errors & Troubleshooting.

3. Invalid Input (400)

1 {
2   "error": {
3     "code": "invalid_request_error",
4     "message": "Unsupported audio format"
5   }
6 }

Solution: Check supported formats and requirements

4. Failed to read the file (400)

1 {
2   "error": {
3     "message": "Failed to read the file, please check the audio format.",
4     "code": "invalid_request_error"
5   }
6 }

This almost always means the uploaded bytes are not a readable audio file — not that the format is unsupported. Common causes:

Empty or zero-length file — the upload contains no bytes, or a buffer of all zeros
Empty WebM blob from a browser recorder — MediaRecorder produced a header with no audio frames (see “How do I record audio in the browser?” below)
Junk or placeholder bytes — the payload isn’t a real audio container
Truncated or incomplete container — the file was cut off during recording, download, or copy
Passing a filename string instead of a file object — use file=open("audio.wav", "rb") in Python, not file="audio.wav"

Solution: before uploading, verify the file exists, its size is greater than 0, and you’re passing a file handle/stream (not a path string).

See our error handling guide for more details.

How do I record audio in the browser and transcribe it?

All the Node.js examples in these docs read audio with fs.createReadStream(...), which doesn’t exist in the browser. To transcribe microphone audio from a web page, record with MediaRecorder and upload the resulting blob.

The most common mistake is uploading an empty WebM blob (a container header with no audio frames), which the API rejects with "Failed to read the file, please check the audio format." The recipe below avoids that by stopping the recorder cleanly, waiting for the final dataavailable event, and checking blob.size > 0 before uploading:

1 async function recordAndTranscribe(durationMs = 5000) {
2   // 1. Capture the microphone
3   const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
4 
5   // 2. Pick a supported mimeType (WebM/Opus is widely supported and accepted by the API)
6   const mimeType = MediaRecorder.isTypeSupported("audio/webm;codecs=opus")
7     ? "audio/webm;codecs=opus"
8     : "audio/webm";
9 
10   const recorder = new MediaRecorder(stream, { mimeType });
11   const chunks = [];
12   recorder.ondataavailable = (event) => {
13     if (event.data.size > 0) chunks.push(event.data);
14   };
15 
16   // 3. Stop cleanly: the final dataavailable fires before "stop" resolves
17   const stopped = new Promise((resolve) => (recorder.onstop = resolve));
18   recorder.start();
19   setTimeout(() => recorder.stop(), durationMs);
20   await stopped;
21   stream.getTracks().forEach((track) => track.stop());
22 
23   // 4. Never upload an empty recording
24   const blob = new Blob(chunks, { type: mimeType });
25   if (blob.size === 0) {
26     throw new Error("Recording is empty — no audio frames were captured.");
27   }
28 
29   // 5. Upload to the Speech-to-Text API
30   const formData = new FormData();
31   formData.append("file", blob, "recording.webm");
32   formData.append("model", "saaras:v3");
33   formData.append("mode", "transcribe");
34 
35   const response = await fetch("https://api.sarvam.ai/speech-to-text", {
36     method: "POST",
37     headers: { "api-subscription-key": SARVAM_API_KEY },
38     body: formData,
39   });
40   return await response.json();
41 }

Pre-flight checklist before any upload:

The recording/file exists and size > 0
You’re sending the blob/file object, not a path or filename string
Audio longer than 30 seconds goes to the Batch API instead of the sync REST endpoint

Don’t ship your API key in client-side code. In production, upload the recording to your own backend and call the Sarvam API from there.

How do I optimize for real-time transcription?

Tips for optimal real-time performance:

Audio Settings

1 const config = {
2   sampleRate: 16000,
3   encoding: 'LINEAR16',
4   channels: 1
5 }

Chunk Size

Optimal: 100ms - 500ms chunks
Balance between latency and accuracy

WebSocket Connection

1 const ws = new WebSocket('wss://api.sarvam.ai/v1/stt/stream')
2 ws.binaryType = 'arraybuffer'

Error Handling

1 ws.onerror = (error) => {
2   console.error('WebSocket Error:', error)
3   // Implement reconnection logic
4 }

View our real-time streaming guide for detailed examples.

Billing & Support

How is usage calculated?

Usage is calculated based on:

Audio Duration

Rounded up to the nearest second
Minimum charge: 1 second

Features Used

Base transcription
Speaker diarization (+20%)
Language detection (+10%)
Word timestamps (+10%)

Model Type

Saarika: Base rate
Saaras: Premium rate

Example calculation:

5 minutes audio × Base rate
+ Speaker diarization (20%)
+ Word timestamps (10%)
= Total cost

How do I get support?

Multiple support channels available:

Documentation

Community

Discord Community

Direct Support

Email: developer@sarvam.ai
Enterprise: Dedicated support manager

Still Have Questions?

Can’t find what you’re looking for?

Our team is here to help! Reach out through any of our support channels.

Join Discord

Email Support