FAQs

View as Markdown

Frequently Asked Questions

Find answers to common questions about our speech-to-text services

General Questions

REST and Batch APIs support a wide range of audio formats including:

  • WAV
  • MP3
  • M4A
  • AAC
  • OGG
  • FLAC
  • WebM
  • PCM (pcm_s16le, pcm_l16, pcm_raw)

WebSocket/Streaming APIs only support:

  • WAV
  • Raw PCM (pcm_s16le, pcm_l16, pcm_raw)

For optimal results, we recommend:

  • Sample rate: 16kHz or higher
  • Bit depth: 16-bit
  • Channels: Mono or Stereo

Our models support multiple Indian and global languages:

Indian Languages

  • Hindi
  • English (Indian)
  • Bengali
  • Tamil
  • Telugu
  • Kannada
  • Malayalam
  • Marathi
  • Gujarati
  • Punjabi

Global Languages

  • English (US, UK, AU)
  • French
  • German
  • Spanish
  • Japanese

Check our models page for the complete list and specific model capabilities.

The limits vary by API endpoint:

REST API

  • Maximum duration: 30 seconds per request

Batch API

  • Maximum duration: 2 hours per file
  • Maximum files per job: 20

WebSocket API (Streaming)

  • Continuous streaming with chunked audio — no duration limit
  • Concurrency limits apply per plan (see Rate Limits)

For audio longer than 30 seconds, use the Batch API. For files longer than 2 hours, we recommend:

  1. Splitting into smaller segments
  2. Contacting support for custom solutions

Accuracy varies based on several factors:

Typical Accuracy Rates

  • Clear speech, minimal background noise: 95-98%
  • Multiple speakers, moderate noise: 90-95%
  • Heavy accent or background noise: 85-90%

Factors affecting accuracy:

  • Audio quality
  • Background noise
  • Speaker accent
  • Speaking speed
  • Domain-specific terminology

Use our interactive API reference to test with your specific audio.

Technical Questions

Speaker diarization identifies and labels different speakers in the audio:

  1. Process:

    • Voice activity detection
    • Speaker segmentation
    • Speaker clustering
    • Speaker labeling
  2. Usage (via Batch API):

1from sarvamai import SarvamAI
2
3client = SarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")
4
5# Speaker diarization is available through the Batch API
6# See: https://docs.sarvam.ai/api-reference-docs/speech-to-text/batch
7job = client.speech_to_text_job.create_job(
8 model="saaras:v3",
9 mode="transcribe",
10 with_diarization=True,
11)
12job.upload_files(file_paths=["audio.mp3"])
13job.start()
14job.wait_until_complete()
15job.download_outputs(output_dir="./output")
  1. Output:
    1{
    2 "segments": [
    3 {
    4 "speaker": "Speaker 1",
    5 "text": "Hello, how are you?",
    6 "start": 0.0,
    7 "end": 1.5
    8 },
    9 {
    10 "speaker": "Speaker 2",
    11 "text": "I'm doing well, thanks!",
    12 "start": 1.8,
    13 "end": 3.2
    14 }
    15 ]
    16}

Rate limits are applied per account based on your subscription plan:

PlanRate Limit
Starter60 requests/min
Pro200 requests/min
Business1,000 requests/min
EnterpriseCustom limits

Duration Limits

  • REST API: Max 30 seconds of audio per request
  • Batch API: Up to 2 hours per file, 20 files per job
  • Streaming API: Continuous (chunked) streaming; concurrency limits per plan

For batch endpoints, implement a minimum 5ms delay between status polling requests.

View the full Credits & Rate Limits page for details on HTTP headers, error handling, and upgrade paths.

Common errors and solutions:

1. Authentication Errors (403)

1{
2 "error": {
3 "code": "invalid_api_key_error",
4 "message": "API key is invalid or expired"
5 }
6}

Solution: Check API key validity and proper configuration. Note: Sarvam returns HTTP 403 (not 401) for invalid/missing API keys — see the Authentication page.

2. Rate Limit / Quota Errors (429)

1{
2 "error": {
3 "code": "insufficient_quota_error",
4 "message": "API quota exceeded"
5 }
6}

Solution: Implement exponential backoff or upgrade plan. A 429 with rate_limit_exceeded_error means too many requests; insufficient_quota_error means credits are exhausted — see Errors & Troubleshooting.

3. Invalid Input (400)

1{
2 "error": {
3 "code": "invalid_request_error",
4 "message": "Unsupported audio format"
5 }
6}

Solution: Check supported formats and requirements

4. Failed to read the file (400)

1{
2 "error": {
3 "message": "Failed to read the file, please check the audio format.",
4 "code": "invalid_request_error"
5 }
6}

This almost always means the uploaded bytes are not a readable audio file — not that the format is unsupported. Common causes:

  • Empty or zero-length file — the upload contains no bytes, or a buffer of all zeros
  • Empty WebM blob from a browser recorderMediaRecorder produced a header with no audio frames (see “How do I record audio in the browser?” below)
  • Junk or placeholder bytes — the payload isn’t a real audio container
  • Truncated or incomplete container — the file was cut off during recording, download, or copy
  • Passing a filename string instead of a file object — use file=open("audio.wav", "rb") in Python, not file="audio.wav"

Solution: before uploading, verify the file exists, its size is greater than 0, and you’re passing a file handle/stream (not a path string).

See our error handling guide for more details.

All the Node.js examples in these docs read audio with fs.createReadStream(...), which doesn’t exist in the browser. To transcribe microphone audio from a web page, record with MediaRecorder and upload the resulting blob.

The most common mistake is uploading an empty WebM blob (a container header with no audio frames), which the API rejects with "Failed to read the file, please check the audio format." The recipe below avoids that by stopping the recorder cleanly, waiting for the final dataavailable event, and checking blob.size > 0 before uploading:

1async function recordAndTranscribe(durationMs = 5000) {
2 // 1. Capture the microphone
3 const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
4
5 // 2. Pick a supported mimeType (WebM/Opus is widely supported and accepted by the API)
6 const mimeType = MediaRecorder.isTypeSupported("audio/webm;codecs=opus")
7 ? "audio/webm;codecs=opus"
8 : "audio/webm";
9
10 const recorder = new MediaRecorder(stream, { mimeType });
11 const chunks = [];
12 recorder.ondataavailable = (event) => {
13 if (event.data.size > 0) chunks.push(event.data);
14 };
15
16 // 3. Stop cleanly: the final dataavailable fires before "stop" resolves
17 const stopped = new Promise((resolve) => (recorder.onstop = resolve));
18 recorder.start();
19 setTimeout(() => recorder.stop(), durationMs);
20 await stopped;
21 stream.getTracks().forEach((track) => track.stop());
22
23 // 4. Never upload an empty recording
24 const blob = new Blob(chunks, { type: mimeType });
25 if (blob.size === 0) {
26 throw new Error("Recording is empty — no audio frames were captured.");
27 }
28
29 // 5. Upload to the Speech-to-Text API
30 const formData = new FormData();
31 formData.append("file", blob, "recording.webm");
32 formData.append("model", "saaras:v3");
33 formData.append("mode", "transcribe");
34
35 const response = await fetch("https://api.sarvam.ai/speech-to-text", {
36 method: "POST",
37 headers: { "api-subscription-key": SARVAM_API_KEY },
38 body: formData,
39 });
40 return await response.json();
41}

Pre-flight checklist before any upload:

  • The recording/file exists and size > 0
  • You’re sending the blob/file object, not a path or filename string
  • Audio longer than 30 seconds goes to the Batch API instead of the sync REST endpoint

Don’t ship your API key in client-side code. In production, upload the recording to your own backend and call the Sarvam API from there.

Tips for optimal real-time performance:

  1. Audio Settings
1const config = {
2 sampleRate: 16000,
3 encoding: 'LINEAR16',
4 channels: 1
5}
  1. Chunk Size
  • Optimal: 100ms - 500ms chunks
  • Balance between latency and accuracy
  1. WebSocket Connection
1const ws = new WebSocket('wss://api.sarvam.ai/v1/stt/stream')
2ws.binaryType = 'arraybuffer'
  1. Error Handling
1ws.onerror = (error) => {
2 console.error('WebSocket Error:', error)
3 // Implement reconnection logic
4}

View our real-time streaming guide for detailed examples.

Billing & Support

Usage is calculated based on:

  1. Audio Duration
  • Rounded up to the nearest second
  • Minimum charge: 1 second
  1. Features Used
  • Base transcription
  • Speaker diarization (+20%)
  • Language detection (+10%)
  • Word timestamps (+10%)
  1. Model Type
  • Saarika: Base rate
  • Saaras: Premium rate

Example calculation:

5 minutes audio × Base rate
+ Speaker diarization (20%)
+ Word timestamps (10%)
= Total cost

Multiple support channels available:

  1. Documentation
  1. Community
  1. Direct Support

Still Have Questions?

Can’t find what you’re looking for?

Our team is here to help! Reach out through any of our support channels.