FAQs

Frequently Asked Questions

Find answers to common questions about our speech-to-text services

General Questions

REST and Batch APIs support a wide range of audio formats including:

  • WAV
  • MP3
  • M4A
  • AAC
  • OGG
  • FLAC
  • WebM
  • PCM (pcm_s16le, pcm_l16, pcm_raw)

WebSocket/Streaming APIs only support:

  • WAV
  • Raw PCM (pcm_s16le, pcm_l16, pcm_raw)

For optimal results, we recommend:

  • Sample rate: 16kHz or higher
  • Bit depth: 16-bit
  • Channels: Mono or Stereo

Our models support multiple Indian and global languages:

Indian Languages

  • Hindi
  • English (Indian)
  • Bengali
  • Tamil
  • Telugu
  • Kannada
  • Malayalam
  • Marathi
  • Gujarati
  • Punjabi

Global Languages

  • English (US, UK, AU)
  • French
  • German
  • Spanish
  • Japanese

Check our models page for the complete list and specific model capabilities.

The limits vary by API endpoint:

REST API

  • Maximum file size: 1GB
  • Maximum duration: 4 hours

WebSocket API (Streaming)

  • No file size limit
  • Maximum continuous stream duration: 8 hours

For longer audio files, we recommend:

  1. Splitting into smaller segments
  2. Using batch processing
  3. Contacting support for custom solutions

Accuracy varies based on several factors:

Typical Accuracy Rates

  • Clear speech, minimal background noise: 95-98%
  • Multiple speakers, moderate noise: 90-95%
  • Heavy accent or background noise: 85-90%

Factors affecting accuracy:

  • Audio quality
  • Background noise
  • Speaker accent
  • Speaking speed
  • Domain-specific terminology

Use our playground to test with your specific audio.

Technical Questions

Speaker diarization identifies and labels different speakers in the audio:

  1. Process:

    • Voice activity detection
    • Speaker segmentation
    • Speaker clustering
    • Speaker labeling
  2. Usage (via Batch API):

1from sarvamai import SarvamAI
2
3client = SarvamAI(api_subscription_key="YOUR_SARVAM_API_KEY")
4
5# Speaker diarization is available through the Batch API
6# See: https://docs.sarvam.ai/api-reference-docs/speech-to-text/batch
7response = client.speech_to_text.batch_transcribe(
8 file=open("audio.mp3", "rb"),
9 model="saaras:v3",
10 mode="transcribe",
11 with_diarization=True
12)
  1. Output:
    1{
    2 "segments": [
    3 {
    4 "speaker": "Speaker 1",
    5 "text": "Hello, how are you?",
    6 "start": 0.0,
    7 "end": 1.5
    8 },
    9 {
    10 "speaker": "Speaker 2",
    11 "text": "I'm doing well, thanks!",
    12 "start": 1.8,
    13 "end": 3.2
    14 }
    15 ]
    16}

Rate limits are applied per account based on your subscription plan:

PlanRate Limit
Starter60 requests/min
Pro200 requests/min
Business1,000 requests/min
EnterpriseCustom limits

File Size Limits

  • REST API: Max 1GB file, up to 4 hours duration
  • Streaming API: No file size limit, up to 8 hours continuous

For batch endpoints, implement a minimum 5ms delay between status polling requests.

View the full Credits & Rate Limits page for details on HTTP headers, error handling, and upgrade paths.

Common errors and solutions:

1. Authentication Errors (401)

1{
2 "error": "invalid_api_key",
3 "message": "API key is invalid or expired"
4}

Solution: Check API key validity and proper configuration

2. Rate Limit Errors (429)

1{
2 "error": "rate_limit_exceeded",
3 "message": "Rate limit exceeded",
4 "retry_after": 3600
5}

Solution: Implement exponential backoff or upgrade plan

3. Invalid Input (400)

1{
2 "error": "invalid_input",
3 "message": "Unsupported audio format"
4}

Solution: Check supported formats and requirements

See our error handling guide for more details.

Tips for optimal real-time performance:

  1. Audio Settings
1const config = {
2 sampleRate: 16000,
3 encoding: 'LINEAR16',
4 channels: 1
5}
  1. Chunk Size
  • Optimal: 100ms - 500ms chunks
  • Balance between latency and accuracy
  1. WebSocket Connection
1const ws = new WebSocket('wss://api.sarvam.ai/v1/stt/stream')
2ws.binaryType = 'arraybuffer'
  1. Error Handling
1ws.onerror = (error) => {
2 console.error('WebSocket Error:', error)
3 // Implement reconnection logic
4}

View our real-time guide for detailed examples.

Billing & Support

Usage is calculated based on:

  1. Audio Duration
  • Rounded up to the nearest second
  • Minimum charge: 1 second
  1. Features Used
  • Base transcription
  • Speaker diarization (+20%)
  • Language detection (+10%)
  • Word timestamps (+10%)
  1. Model Type
  • Saarika: Base rate
  • Saaras: Premium rate

Example calculation:

5 minutes audio × Base rate
+ Speaker diarization (20%)
+ Word timestamps (10%)
= Total cost

Multiple support channels available:

  1. Documentation
  1. Community
  1. Direct Support

Still Have Questions?

Can’t find what you’re looking for?

Our team is here to help! Reach out through any of our support channels.