Speech-to-Text APIs
Sarvam AI offers two powerful speech models:
High accuracy ASR model that transcribes Indian language speech directly into the same spoken language across diverse audio conditions.
High accuracy ASR model that auto-detects the input language and delivers precise English-translated transcripts for Indian speech.
API Types
Synchronous processing for files under 30 seconds.
Asynchronous processing for files up to 1 hour.
Real-time audio streaming with instant results.
Supported Audio Formats & MIME Types
The STT and STTT APIs support over 10 major audio formats and MIME type variants. Supported formats and MIME types are listed below:
For most audio formats, our API automatically detects the codec. However, when
using PCM formats (pcm_s16le, pcm_l16, pcm_raw), you must explicitly
specify the input_audio_codec parameter. PCM files are only supported at
16kHz sample rate.
Technical Capabilities
- 10+ Indian languages and English
- Automatic language detection
- Code-mixing support
- Multi-speaker handling
- Speaker diarization (Batch API)
- Timestamp generation
- Entity preservation
- Telephony optimization
Next Steps
Need help choosing the right API? Contact us on discord for guidance.