Speech-to-Text APIs

Sarvam AI offers two powerful speech models:

API Types

Supported Audio Formats & MIME Types

The STT and STTT APIs support over 10 major audio formats and MIME type variants. Supported formats and MIME types are listed below:

Format GroupSupported MIME Types
MP3 Variantsmpeg, mp3, mpeg3, x-mpeg-3, x-mp3
WAV Variantswav, x-wav, wave
AAC Variantsaac, x-aac
AIFF Variantsaiff, x-aiff
OGG / Opus Formatsogg, opus
FLAC Variants (Lossless)flac, x-flac
MP4 / M4A Audiomp4, x-m4a
AMR (Narrowband)amr
WMA (Windows Media Audio)x-ms-wma
WEBM (Audio & Video)webm, webm
PCM Formatspcm_s16le, pcm_l16, pcm_raw

For most audio formats, our API automatically detects the codec. However, when using PCM formats (pcm_s16le, pcm_l16, pcm_raw), you must explicitly specify the input_audio_codec parameter. PCM files are only supported at 16kHz sample rate.


Technical Capabilities

Language Support
  • 10+ Indian languages and English
  • Automatic language detection
  • Code-mixing support
  • Multi-speaker handling
Advanced Processing
  • Speaker diarization (Batch API)
  • Timestamp generation
  • Entity preservation
  • Telephony optimization

Next Steps

1

Choose Your API

Select the appropriate API type based on your use case.

2

Get API Key

Sign up and get your API key from the dashboard.

3

Go Live

Deploy your integration and monitor usage in the dashboard.

Need help choosing the right API? Contact us on discord for guidance.