FAQs
Frequently Asked Questions
Find answers to common questions about our speech-to-text services
General Questions
What audio formats are supported?
REST and Batch APIs support a wide range of audio formats including:
- WAV
- MP3
- M4A
- AAC
- OGG
- FLAC
- WebM
- PCM (pcm_s16le, pcm_l16, pcm_raw)
WebSocket/Streaming APIs only support:
- WAV
- Raw PCM (pcm_s16le, pcm_l16, pcm_raw)
For optimal results, we recommend:
- Sample rate: 16kHz or higher
- Bit depth: 16-bit
- Channels: Mono or Stereo
What languages are supported?
Our models support multiple Indian and global languages:
Indian Languages
- Hindi
- English (Indian)
- Bengali
- Tamil
- Telugu
- Kannada
- Malayalam
- Marathi
- Gujarati
- Punjabi
Global Languages
- English (US, UK, AU)
- French
- German
- Spanish
- Japanese
Check our models page for the complete list and specific model capabilities.
What is the maximum duration?
The limits vary by API endpoint:
REST API
- Maximum duration: 30 seconds per request
Batch API
- Maximum duration: 2 hours per file
- Maximum files per job: 20
WebSocket API (Streaming)
- Continuous streaming with chunked audio — no duration limit
- Concurrency limits apply per plan (see Rate Limits)
For audio longer than 30 seconds, use the Batch API. For files longer than 2 hours, we recommend:
- Splitting into smaller segments
- Contacting support for custom solutions
How accurate is the transcription?
Accuracy varies based on several factors:
Typical Accuracy Rates
- Clear speech, minimal background noise: 95-98%
- Multiple speakers, moderate noise: 90-95%
- Heavy accent or background noise: 85-90%
Factors affecting accuracy:
- Audio quality
- Background noise
- Speaker accent
- Speaking speed
- Domain-specific terminology
Use our interactive API reference to test with your specific audio.
Technical Questions
How does speaker diarization work?
Speaker diarization identifies and labels different speakers in the audio:
-
Process:
- Voice activity detection
- Speaker segmentation
- Speaker clustering
- Speaker labeling
-
Usage (via Batch API):
- Output:
What are the rate limits?
Rate limits are applied per account based on your subscription plan:
Duration Limits
- REST API: Max 30 seconds of audio per request
- Batch API: Up to 2 hours per file, 20 files per job
- Streaming API: Continuous (chunked) streaming; concurrency limits per plan
For batch endpoints, implement a minimum 5ms delay between status polling requests.
View the full Credits & Rate Limits page for details on HTTP headers, error handling, and upgrade paths.
How do I handle errors?
Common errors and solutions:
1. Authentication Errors (403)
Solution: Check API key validity and proper configuration. Note: Sarvam returns HTTP 403 (not 401) for invalid/missing API keys — see the Authentication page.
2. Rate Limit / Quota Errors (429)
Solution: Implement exponential backoff or upgrade plan. A 429 with rate_limit_exceeded_error means too many requests; insufficient_quota_error means credits are exhausted — see Errors & Troubleshooting.
3. Invalid Input (400)
Solution: Check supported formats and requirements
4. Failed to read the file (400)
This almost always means the uploaded bytes are not a readable audio file — not that the format is unsupported. Common causes:
- Empty or zero-length file — the upload contains no bytes, or a buffer of all zeros
- Empty WebM blob from a browser recorder —
MediaRecorderproduced a header with no audio frames (see “How do I record audio in the browser?” below) - Junk or placeholder bytes — the payload isn’t a real audio container
- Truncated or incomplete container — the file was cut off during recording, download, or copy
- Passing a filename string instead of a file object — use
file=open("audio.wav", "rb")in Python, notfile="audio.wav"
Solution: before uploading, verify the file exists, its size is greater than 0, and you’re passing a file handle/stream (not a path string).
See our error handling guide for more details.
How do I record audio in the browser and transcribe it?
All the Node.js examples in these docs read audio with fs.createReadStream(...), which doesn’t exist in the browser. To transcribe microphone audio from a web page, record with MediaRecorder and upload the resulting blob.
The most common mistake is uploading an empty WebM blob (a container header with no audio frames), which the API rejects with "Failed to read the file, please check the audio format." The recipe below avoids that by stopping the recorder cleanly, waiting for the final dataavailable event, and checking blob.size > 0 before uploading:
Pre-flight checklist before any upload:
- The recording/file exists and
size > 0 - You’re sending the blob/file object, not a path or filename string
- Audio longer than 30 seconds goes to the Batch API instead of the sync REST endpoint
Don’t ship your API key in client-side code. In production, upload the recording to your own backend and call the Sarvam API from there.
How do I optimize for real-time transcription?
Tips for optimal real-time performance:
- Audio Settings
- Chunk Size
- Optimal: 100ms - 500ms chunks
- Balance between latency and accuracy
- WebSocket Connection
- Error Handling
View our real-time streaming guide for detailed examples.
Billing & Support
How is usage calculated?
Usage is calculated based on:
- Audio Duration
- Rounded up to the nearest second
- Minimum charge: 1 second
- Features Used
- Base transcription
- Speaker diarization (+20%)
- Language detection (+10%)
- Word timestamps (+10%)
- Model Type
- Saarika: Base rate
- Saaras: Premium rate
Example calculation:
How do I get support?
Multiple support channels available:
- Documentation
- Community
- Direct Support
- Email: developer@sarvam.ai
- Enterprise: Dedicated support manager